Quick poll to decide between two approaches to config / input cat I/O handling

barnabytprowe commented 12 years ago

Dear all,

In the MultiObjectDemo.py script being developed in #103 (and #101, see below), we are currently setting a bunch of object parameter values using a clever, home-grown (well, largely borrowed from Jim) data structure called an AttributeDict. I won't go into the details, but it means you can store info like:

    config.psf.type = 'Moffat'
    config.psf.beta.type = 'InputCatalog'
    config.psf.beta.col = 5

... In this case saying that we want the PSF to be a Moffat, and that it's beta value comes from the 5th column of an input ASCII catalog. The config structure is a set of nested dictionaries that manifest as object attributes. Between us we agree that this is a handy, minimal-syntax way to let users tell GalSim what to do. The sort of statements above could also be made in user configuration files, and Python can parse them easily using eval() and exec-related commands.

For a fuller example of this usage see examples/MultiObjectDemo.py, Script 2, in either branch #101 or #103.

Yesterday evening, @rmjarvis and I decided to try our independently-conceived approaches to parsing this information and using it to generate GSObject class instances (e.g. Moffats, Sersics) as per the user instructions. The broader discussion relating to this, and the decision to work in parallel, is to be found on Issue #101, from around 2/3 of the way down.

Mike and I are now going to post a short (one paragraph) description of our approach, and then we'll ask the masses (i.e. anyone with interest or some useful technical points to make) to offer a preference. Neither will remain fixed, and no work is lost: we can always revert. We just want to know what to go with for Milestone 2, and so we'll make a decision on this before the end of the day.

Thank you!

barnabytprowe commented 12 years ago

I should just add: both of us have written functions to do this called BuildGSObject() in galsim/frontend.py, with mine on #101 and Mike's on #103. Both are similar length solutions, and take similar time.

barnabytprowe commented 12 years ago

Barney's approach to frontend.BuildGSObject.py:

To be able to initialize GSObjects, the code needs to know what parameters are expected as the arguments (and/or keyword arguments) to the GSObject derived classes in base.py, since it was decided not to allow default initialization and later on-the-fly setting of these values. Moreover it needs to know which params are required, which are single parameter sizes (and thus need one but only one of a set of optional size params to be specified), and which are optional.

Therefore, I store this information in an object_param_dict dictionary in base.py, alongside the base classes themselves. Here it is, line 373-410: https://github.com/GalSim-developers/GalSim/blob/%23101/galsim/base.py

frontend.BuildGSObject() then uses this information (accessible from galsim.object_param_dict) to parse all of the information in a config structure of the sort described above: it is a single generic function that uses all of the info in this dictionary to control its behaviour, rather than being a series of custom-built, object-specific functions that get switched between.

I like this approach because:

It localises all of the structure into one place, one object, the object_param_dict in base.py.
Handily, for ongoing development, this is also right beside the GSObject derived classes themselves, and so any changes to that structure can be propagated by developers to the I/O handling by a simple modification in the same file. Unit testing is also very easy (initalize using the dict and catch TypeErrors or other Errors), so this can be fully checked against.
It occurred to me when writing the object_param_dict that it is also a useful piece of reference, and could be used to tabulate the GalSim inputs for a user-guide style piece of documentation, and could even be used to automatically update that documentation.

rmjarvis commented 12 years ago

The file to look at for my front end is here.

The basic idea (which is that same as for Barney's version, so you'll probably read this twice from different perspectives) is that each item in the config file that isn't just a simple value should have an attribute called type that defines what kind of thing it is.

For GSObject types, these would be Gaussian, Sersic, Exponential, etc. They are not required to match our class names, although most do. E.g. we have Sum rather than Add, since it seemed more natural in the context of the config file. And while we do have Pixel, we also have SquarePixel.

For things that are values, they can just be the value. Or they can have a type. So far the only type like this that either of us have implemented is InputCatalog which means to read the value from a catalog. But we plan to add things like GaussianDeviate, UniformDeviate, etc. (possibly with shorter type names) to describe how we want to do random shifts and such.

In either case, given a type, there are other required and/or optional attributes which define how to make the object or value. The main difference in Barney's and my approaches is how to parse these extra attributes in a clean, easily extensible way.

We both start with the BuildGSObject function as Barney said. Then my approach is to have a different separate function for each possible type. So I have

BuildGaussian
BuildExponential
BuildSum
BuildSquarePixel
...

And for the values, I have

Generate   # The starting point for all values with a type or not.
GenerateFromInputCatalog

with plans to add things like

GenerateFromGaussianDeviate
GenerateFromUniformDeviate
...

Then the fancy python trick to make this all easy is the use of the eval function: eval('Build' + config.type) or eval('GenerateFrom' + config.type) to get the right function name in each case. That's it.

(Phew, Barney didn't stick to one paragraph either.)

barnabytprowe commented 12 years ago

Thanks Mike! I think that's pretty clear.

OK, so now we turn to you guys for help. Compromise solutions / improvements between the two (I have thought of one possibility already but am biting my tongue) will not be accepted. We have only one day to go - choose the one which looks most like what you'd prefer, and we can continue this discussion and change it post Milestone.

Any and all of the @GalSim-developers/developers... Please give us your preference as to Barney's approach / Mike's approach.

Do not worry, we won't take it personally!

barnabytprowe commented 12 years ago

Hmmm, it's not letting me ping people, apologies if there suddenly ends up being a backlog of messages. I'll delete them, very sorry.

Trying again: in particular, those of you I know have good Python experience - @TallJimbo , @joezuntz , @joergdietrich , @pmelchior , @rearmstr , @reikonakajima - we'd love your opinion.

barnabytprowe commented 12 years ago

Would also welcome comments from more modest Pytho-maniacs!

rmandelb commented 12 years ago

I'm going to go look at what you guys wrote now. But - do you feel that being forced to choose one option at this point is unnatural? We opted to work in terms of Milestones in order to inspire effort, not to make us unnaturally choose between options that should be developed further in order to make an informed decision. (Of course, in some cases we've wanted to get something in place, deliberately assuming that that design will get scrapped in favor of something else later.)

barnabytprowe commented 12 years ago

I should have emphasised this more - we are just choosing what to merge into the master for this milestone. Everything in the code is up for debate at any time, and I hope always will be.

Anyway, there's still a bit to do for the demo scripts, and we need to focus efforts on getting shears and shifts working with whatever framework is adopted: not massive amounts of labour, but a little bit. So we do need to make a binary choice. Nothing is lost... It's all in the repo now anyway, for good.

rmandelb commented 12 years ago

Yes, of course. I only meant that if you think your effort would be better spent developing these further rather than picking one and doing shears, shifts, etc., then I would be okay with that. But there's a value to getting something in and working with it.

So far, Barney, I've just gone through yours, and I agree with your assessment of the value of this object_param_dict that compiles the information about the different base classes.

I'm off to look at Mike's, then I'll have something to say on the comparison...

rmjarvis commented 12 years ago

Rachel, I think we've both gotten far enough in the development to see any advantages or disadvantages to each, which was the point in doing the parallel development. Last night, it was hard for us to see how our different ideas would play out without a concrete prototype to look at. I think what we've done captures most of the interesting aspects of the two designs.

TallJimbo commented 12 years ago

After a very quick look, I'm voting for Barney's solution for now, just because it provides a clear, succinct view of the config schema.

Ultimately, I think we'll want to integrate the schema definition into the code that parses it so they can't become inconsistent, as Mike has done, but I think we need to take care to ensure the schema definition is still very readable when we do that, and at this point I think readability needs to take precedence over maintainability.

rmandelb commented 12 years ago

Thanks, Mike, for the clear description above.

I have a few thoughts on the distinction between a single object_param_dict vs. the many separate BuildWhatever functions.

From the perspective of a developer, I think that there is a slight benefit in the former approach. If someone is monkeying with one of the base classes, for example adding an optional keyword, then everything they need to change (class definition and the object_param_dict) is in base.py, whereas development in the base classes in the latter cases requires updates in base.py and frontend.py. However, I think we would all get used to this split relatively quickly, and even having everything in base.py does not prevent us occasionally forgetting to scroll down within the same file. :) So this benefit is slight. And I think the same argument applies to the value of object_param_dict in generating documentation -- in Mike's scheme, the relevant information is in a bunch of sequentially-defined functions in the same file, so it seems like not such a big difference. I'm not sure whether there's significant difference in ease of unit testing for the approaches?

Given that I think the above preference is slight, I will say what my gut reaction is: I prefer the simplicity of having everything in one object_param_dict... unless this is going to cause truly excessive complexity (or, restrict our ability to do what we eventually want to do with the code) somewhere down the line that I'm not thinking of? I agree that Mike's solution has simplicity in other places -- that each of the BuildWhatever is very simple and self-contained -- but... my gut reaction is to avoid having a ton of such functions. (We already have 2 layers with lots of stuff: SBProfiles and base classes - so this would be adding a 3rd layer on top with many separate functions that have to be maintained etc.)

As you can see I'm pretty waffly, so could be persuaded otherwise if you think I'm missing something. And if someone more informed in python than I am has a strong preference in the other direction, then we should go with that. I'm just not experienced enough in this language to envision possible issues with usage in future.

rmjarvis commented 12 years ago

I think the purported advantage of having Barney's object_param_dict encapsulate all of the structure is actually not true. For example, Sum, Convolve, SquarePixel. So you still need some of the structure encoded in frontend.py and some in the object_param_dict in base.py.

That's true, to be more accurate only all of the required structure / information regarding params to the non-compund GSObjects is stored in the dictionary.

However, Sum and Convolve are so far unique cases, and I don't think other compounds are being advocated by anybody. It's also very clear what they mean (notwithstanding that the Add() class has type Sum, which should be rectified.)

Likewise SquarePixel is a new invention as of yesterday that I would remove - SBProfile only supports square sampling spacings, so I don't see a physical case for non-square pixels. Any distortion can be put in the galaxy & PSF. We should link xw and yw into w and be done with it, in which case I'm happy for w to be a size (size seems like a scalar quantity to me, and keeping it so would tidy up other bits of both of our codes).

To understand Pixel for example, you need to look in the dict to see that there are no sizes specified. Rather xw and yw are both listed as required kwargs. Which is why he doesn't check that at least one size attribute is specified, since that would be wrong for Pixel. So he relies on the constructor eventually raising an exception if someone forgets to specify a size for Exponential, say.

I guess see above!

So I think having everything in one place is more an advantage of my version than Barney's. You can immediately see how all the configuration stuff works by looking through one file. And if you need to write something specialized, it is easy to notice that and to see what to do.

Even knowing very well what is going on, I have to say that looking through a file of semi-voodoo Python like we both have implemented in frontend.py does not seem easy to me: it requires a decent knowledge of Python syntax and the "standard library" commands to fathom what's going on in either case. I think that's a fairly high bar.

One other thought occurs to me: for public release we are going to need some form of an object_param_dict, whether it's only similar information in table form in a .pdf doc or actually used in the code. We also need to keep this updated and accurate, and for efficiency we'll probably want to test for the ongoing accuracy of this information. If it's helpful (~necessary) to have a reference to this information anyway in some form or another for public release, why not actually use it and help guarantee it's accuracy?

I'm saying no more, and although I've rushed to my proposition's defence I'm perfectly happy to adopt the BuildX scheme.

reikonakajima commented 12 years ago

I'm not a Python expert (though familiar enough), but I like Mike's better, in that the code appears simpler. There is benefit in Barney's table, though, as mentioned by others.

joergdietrich commented 12 years ago

I am voting for Barney's version. For the same reasons as Jim.

Jörg Dietrich jorgd@umich.edu http://www-personal.umich.edu/~jorgd/ Physics Department, University of Michigan, Tel. +1 734 615 4256 450 Church Street Fax +1 734 763 9694 Ann Arbor, MI 48109-1040, USA

pmelchior commented 12 years ago

I prefer Barney's solution. The reason is that is comes in one piece. I believe, Mike's solution might be somewhat easier to maintain in the long run, but the limitation of Barney's approach are not severe enough for me to justify the additional complexity of Mike's.

joezuntz commented 12 years ago

I prefer 103, as it seems more explicit. I think it could probably be simplified, though.

rmandelb commented 12 years ago

Looks like we have a decision, so I'm closing this. Can reopen for more discussion if needed...

barnabytprowe commented 12 years ago

This issue closed following proposals on #101, thanks for the input everybody. This will be an important issue that we'll return to so as to improve our solution, I am 100% certain of that!

GalSim-developers / GalSim

Quick poll to decide between two approaches to config / input cat I/O handling #143