Mechanisms for more restricted config files

TallJimbo commented 12 years ago

We've made good progress prototyping our configuration files using the AttributeDict class on other issues (#101, #103), but eventually we'll probably want to use something that makes it easier to catch user errors and define what all the available configuration options are.

I plan to use LSST's pex_config package as a starting point, but I don't think we need all the features it has and the complexity they add.

I'm assigning this to myself to start with, as I've already thought about how to do this quite a bit when we were designing pex_config. Once the basic skeleton is in place I'll probably want to pass it off to others for refinement if other people like the approach.

rmjarvis commented 12 years ago

Just to get this out in the public record rather than sitting in my head, here are some other items that still need to be done for the config parsing (most of which I marked as TODO somewhere):

Add more ways to specify a WCS. e.g. the CD matrix. Probably don't need a full gamut of things like TAN and TNX yet, but maybe down the line.
Check that the way I implement the WCS shear is accurate. I'm not sure how best to do this, but I'm not overly confident that I did it right.
I currently hard-code the noise = sky_variance/100 parameter for drawShoot. That is, if you've shot enough photons that the shooting noise drops to this value, go ahead and stop (even if it's not at nphotons = flux yet). This means that after adding the sky noise, the galaxy part of the image will have up to 1% extra noise than it should have. This seems a reasonable choice, but it might be worth making this a parameter that the user can specify.
I thought about trying to add a way to include an arithmetic calculation in the yaml script. An example is lam_over_D in BD3.yaml. In the original, we calculated this from lambda and tel_diam. It might be nice to have a way to do this in the config files, but yaml doesn't support any arithmetic (or any evaluated code for that matter) natively, so if we want to do this it would have to be through some special parsing. e.g. type = Eval, str = '...'. But to be worthwhile, you'd have to be able to access other values that had already been specified, and how to do that wasn't obvious.
Sometimes I think we'll want to add the psf image to a second HDU in the same file as the galaxy images. For example, this is how Erin likes to make his test images. Probably also a weight variance image, and eventually a bad pixel mask. I didn't yet implement how to do this.
There are a bunch of places where I read a value in from the config file just as a value, rather than letting it get parsed by the mechanism that looks for a dict with a type. I should just make it a policy to have every value read in that way, even if I can't imagine a use for it. Somebody probably will eventually, so better just to have the parsing mechanism in place.
I already mentioned that I need to add the ability to make a bunch of files rather than just one. Part of making this work is to add a way to generate file_names in the usual way of incrementing a numeric portion (mytests0001.fits, mytests0002.fits, etc.).
Also, when making multiple files, the multi-processing will proceed more efficiently if we multi-process at that stage rather than on each image. There is some overhead from pickling the images to pass back to the initial process after they are drawn (the downside of using multiprocessing rather than multithreading). I thought to enable this by having nproc in output rather than image mean that this is where the processing split should happen.
Some of the config reading is not really multiprocess-safe. e.g. Things that are generated with a sequence will potentially be done out of order by multiprocessing, so the sequence values will be wrong. One way I thought to fix this is to do all the config parsing in the main thread and then pass the profile to be drawn to each sub-process, which would actually draw the image (the slow part of the code, so what we want to multi-process). However, this requires issue #218.
Related to the above, I suspect multiprocessing is currently very inefficient using RealGalaxy, since the whole real_galaxy_catalog needs to be passed into each process each time. If we switch the config parsing to the main process, this would become manageable (just the two SBInterpolatedImages for that particular real galaxy would be passed in, which would be ok.)
We currently only have ASCII catalogs. We still need to add in FITS catalogs.
I didn't add the Eta options for shear. Just an oversight. Eta1Eta2 and EtaBeta should probably be added.
My dilation specification matches applyDilation which uses a scale rather than mu. Might want to add a way to specify mu instead? OTOH, this would be unnecessary if we enable arithmetic, since then someone could do exp(mu) themselves.

rmandelb commented 12 years ago

Responding to your first message:

1) yes, the truth table is exactly what I meant 2) I'm not sure how useful this would be for config users, who are presumably less comfortable with python? 3) I didn't quite mean typos, I meant things like comments in frontend.py that certain functionality isn't supported, when it does seem to be supported and used in the MOD*.yaml scripts. If you're still updating functionality then I thought those comments might be in a state of flux. But I can fix those directly if it won't cause problems for you. 4) RandomCircularTopHat = a painful mouthful. Ugh. 5) I'm not sure what you're referring to here, sorry.

Re: your second message, we should think about which of these should be done ASAP and which could be deferred until after we've gotten some people to play w/ the code.

rmjarvis commented 12 years ago

5) I'm not sure what you're referring to here, sorry.

shift :
    type : DXDY
    dx : { type : Random , min : -1.0 , max : 1.0 }
    dy : { type : Random , min : -1.0 , max : 1.0 }

rmjarvis commented 12 years ago

3) I can fix those directly if it won't cause problems for you.

Thank you. It won't cause any problems. :)

rmjarvis commented 12 years ago

we should think about which of these should be done ASAP

I think ASAP is probably writing up a use guide for how to use the config files. Assuming everyone is comfortable with the general structure of the config files as I've written them, I'll go ahead and take a stab at this.

And I'll include the user-relevant TODO items, but mark them as not implemented yet. Then users can tell use which things they want us to implement sooner than later.

rmjarvis commented 12 years ago

2) I'm not sure how useful this [ output.type = Array ] would be for config users, who are presumably less comfortable with python?

This came up in #164 when you suggested writing scripts that could do things like ring tests. I thought it might be more convenient even for python users to just set up a config dict with the relevant items and let our config processing make the ring test images as specified. But a pythonista might want to keep the images internally to run their own shear measurement code on them, rather than having to write to disk and then read them back in.

rmandelb commented 12 years ago

5) oh, I see what you mean, we don't need a 2d top hat that is a product of 2 1d ones because they can be specified individually.

3) will do, this weekend

2) yes, for direct python users I agree that getting an array of images out would be useful. I guess I'm not sure in what mode people will work, assuming 3 options: (a) write python scripts directly, in which case they can just set up their array of images themselves, (b) use config without writing much/any python, or (c) work in python but using config tests for scripting of multi-object simulations. I had been thinking that regular python users would use case (a) in which case having config output arrays of images isn't necessary, but if they do (c) then it would be helpful. Possibly we could, for now, flag this as something possible for the future if people find it useful? And we could grill the people who try out GalSim to get a sense for whether there's demand for this?

As for the ASAP question: good point, I think that would be helpful. I think Barney has started work on some of the user's guide-type documents, but not on anything config-related since it was still in-progress. So perhaps you could write up the config-related part of that document, if you're willing? @barnabytprowe , does this sound like a good plan to you?

In that case, perhaps we should decide if there is anything that MUST get into config before we start encouraging people to play with it. My sense is that the answer is no... config is good enough to do what's in our demo scripts, which is the majority of the capabilities of GalSim, and all the other details can be saved for later (perhaps open another issue lumping everything together, since most of those bits are a relatively small amount of coding).

barnabytprowe commented 12 years ago

Hi Mike,

Initial thoughts: wow, this is fantastic. BD1-3.yaml and MOD1-4.yaml all look great, and very clear; they're almost a users' guide to themselves :)

A couple of comments:

So please take a look and see what you think. I've only really added the functionality that was necessary to get these particular scripts working, so I know there are more features we will want to be able to be specify in config files (e.g. multiple output files), but they aren't enabled yet.

I agree with Rachel that we'd thought about maybe putting in some more of the RandomDeviates, and I'm happy to take a look at that myself. Otherwise, I also agree that we shouldn't try to spend too much time trying to second guess what people will want to use in advance. What you've done is a fine demonstration of what's possible, so it will no doubt spur people to suggest things!

Also, the point of this issue is "Mechanisms for more restricted config files". Now that I've enabled at least most of the functionality we'll need, it should be a bit easier to try to determine how much we want to restrict. Right now, I raise exceptions whenever a required item is missing, but I think one of the ideas was to catch when people specify an illegal item, since it is likely a typo. Currently, that item just gets ignored. So that would be a useful addition if we can think of a nice way to do so.

Now that the objects have their params specifed in the _params attribute, this is indeed partially covered... only if the parameter is optional will it be ignored without error if mis-spelled / mis-specified. It would be nice, and it wouldn't be hard to to cycle through the optionals to see if there are any mis-spelled, non-essential parameters being specified that would otherwise be ignored. Not sure how we'd do it for noise deviates or other objects, unless we just adopt the plan of passing them over and wait for a TypeError.

Extra outputs aside from the image files would definitely be useful (e.g. the truth table). I'd add this as output.catalog, similar to input.catalog.

Yep, good idea!

At one point I also thought of having the list of images returned internally as a python list of images. This could be an output.type = Array or Internal.

Inititally thought, hmmm yeah that would be cool, but Rachel makes a good point about non-Python fans being the most likely config users. Can see it not being a priority therefore.

If you see typos, feel free to fix. Thank you. I wasn't planning on working on this more in the next few days.

Will do also.

I'm fine with changing the name of RandomTopHat, since it's apparently confusing. Maybe RandomCircle or RandomCircularTopHat (... other ideas?).

RandomUnitCircle?

barnabytprowe commented 12 years ago

I think ASAP is probably writing up a use guide for how to use the config files. Assuming everyone is comfortable with the general structure of the config files as I've written them, I'll go ahead and take a stab at this.

I'm happy.

And I'll include the user-relevant TODO items, but mark them as not implemented yet. Then users can tell use which things they want us to implement sooner than later.

Excellent idea!

As for the ASAP question: good point, I think that would be helpful. I think Barney has started work on some of the user's guide-type documents, but not on anything config-related since it was still in-progress. So perhaps you could write up the config-related part of that document, if you're willing? @barnabytprowe , does this sound like a good plan to you?

That does sound good, although I'm duty bound to point out as a point of information that I am yet to start the brief user's guide/cheat sheet style document for which I opened an Issue very recently in #236. Hoping to get onto it this weekend! The only question for me is as to whether this guide would be better in the slightly longer user's guide for which we haven't yet got an issue, but which we are also planning very soon. I think it might be better in the longer guide.

In that case, perhaps we should decide if there is anything that MUST get into config before we start encouraging people to play with it. My sense is that the answer is no... config is good enough to do what's in our demo scripts, which is the majority of the capabilities of GalSim, and all the other details can be saved for later (perhaps open another issue lumping everything together, since most of those bits are a relatively small amount of coding).

I would agree, which is fantastic :) :) :)

rmandelb commented 12 years ago

As for the ASAP question: good point, I think that would be helpful. I think Barney has started work on some of the user's guide-type documents, but not on anything config-related since it was still in-progress. So perhaps you could write up the config-related part of that document, if you're willing? @barnabytprowe , does this sound like a good plan to you?

That does sound good, although I'm duty bound to point out as a point of information that I am yet to start the brief user's guide/cheat sheet style document for which I opened an Issue very recently in #236. Hoping to get onto it this weekend! The only question for me is as to whether this guide would be better in the slightly longer user's guide for which we haven't yet got an issue, but which we are also planning very soon. I think it might be better in the longer guide

Good point. I was kind of conflating the two, but I think you're definitely right. The brief guide / cheat sheet would probably only go as far as saying "you can specify simulation parameters using config files, see the examples in example/*yaml or the full user's guide for more detail."

But, the longer guide also needs to get done (at least in SOME form) in the near term, so if Mike is willing to write the config-related bits, that would be super helpful.

rmjarvis commented 12 years ago

only if the parameter is optional will it be ignored without error if mis-spelled / mis-specified

That's right. This is the functionality I thought should be added. (And I think the new parameter dict format makes this relatively easy to implement!) I noticed this because you originally forgot to put obscuration in the OpticalPSF list of parameters. But it didn't give me an error for specifying it in the config file. It just silently made the wrong profile (with obscuration = 0). It wasn't too hard to figure out where the error was, but it would have been nicer to have it tell me that it wasn't expecting to see the obscuration item.

rmjarvis commented 12 years ago

RandomUnitCircle?

I do allow a radius, so "unit" is probably not accurate. I'm leaning toward RandomCircle so far, but still open to other options.

rmandelb commented 12 years ago

+1 for RandomCircle

On Aug 31, 2012, at 1:48 PM, Mike Jarvis wrote:

RandomUnitCircle?

I do allow a radius, so "unit" is probably not accurate. I'm leaning toward RandomCircle so far, but still open to other options.

— Reply to this email directly or view it on GitHub.

Rachel Mandelbaum rmandelb@andrew.cmu.edu http://www.andrew.cmu.edu/~rmandelb/

rmjarvis commented 12 years ago

But, the longer guide also needs to get done (at least in SOME form) in the near term, so if Mike is willing to write the config-related bits, that would be super helpful.

I've started writing some of this on the wiki. Do we want this to also show up in the repo? I'm thinking the repo documentation file could just point to the wiki page. I guess it depends on how much we care about backwards compatibility of the documentation. ie. The wiki always documents the latest version, so might not be completely applicable to the version a particular user has. Not something we need to worry about now though.

barnabytprowe commented 12 years ago

+1 for RandomCircle then... I was thinking "oh you can just multiply it by a real scalar to change the radius" but that's less useful from a config standpoint.

I've started writing some of this on the wiki. Do we want this to also show up in the repo? I'm thinking the repo documentation file could just point to the wiki page. I guess it depends on how much we care about backwards compatibility of the documentation. ie. The wiki always documents the latest version, so might not be completely applicable to the version a particular user has. Not something we need to worry about now though.

I just thought it would be nice to have the documentation in version control too... Then we can cover multiple versions of the code consistently, and we can also use code review to improve the documentation and descriptions at any time. For this reason, I'm starting to wonder if the Wiki has much use as a whole really, except for some of its forum-providing features.

rmjarvis commented 12 years ago

I've completed a first pass of the Config Documentation

rmjarvis commented 12 years ago

3) I didn't quite mean typos, I meant things like comments in frontend.py that certain functionality isn't supported, when it does seem to be supported and used in the MOD*.yaml scripts. If you're still updating functionality then I thought those comments might be in a state of flux. But I can fix those directly if it won't cause problems for you.

When I was writing the documentation, I had some ideas about how to refactor some of the code in frontend.py to make it a bit more maintainable I think, and also make it easier to check if there are any extra unexpected parameters set for various things.

So if you haven't started editing it yet, you should hold off on that. I'll probably do some work on this tomorrow.

rmandelb commented 12 years ago

Yes, I will hold off on editing the comments in frontend.py, no problem. In the worst case, it could wait for a pull request... it was really just 2 cases where we said some functionality wasn't implemented when it actually was.

I'm going to look at your documentation that you put on the wiki in more detail later today, but at a quick glance it looks very thorough! One random point that I noticed and wondered was whether it would be worth clarifying a little more, in the description of the PSFs, that you can set fluxes just as for all the profiles used for galaxies? (which you do mention, but only at the end) As a rule, we want flux=1, but since we allow Add to be used as a PSF (e.g., to make a double Gaussian), I think we do want to state explicitly that if someone uses Add for two of our base classes, they won't automatically get the desired flux=1, unless they either create the original objects with fluxes that sum to 1 or later rescale the flux of the Add.

rmjarvis commented 12 years ago

Good point. I added flux to the psf options with that explanation.

rmjarvis commented 12 years ago

OK, I think I'm done with the config stuff. I've moved things around a bit. The config processing is now in the files:

config/process.py (Top level processing, building images, writing files, etc.)
config/gsobject.py (BuildGSObject and related functions)
config/value.py (ParseValue and related functions)

I also (slightly) changed the way the GSObjects list their parameters from a single _params to _req_params, _opt_params and _single_params. This turned out to be easier to deal with than having the "required", "optional" and "size" specified as strings. I used "single" rather than "size" because the point is that only a single item in the dict may/must be specified. This concept applied to other (non-GSObject) things for parameters other than sizes, so I switched the name to single to be more general.

I then added this style to just about everything that gets built using the config file. If you look through the documentation page I made, you can see how this req/opt/single breakdown pretty much applies to everything there, so it turned out to be pretty convenient.

This also allowed me to have it check for any extra parameters listed in the config file that don't belong, so it will now catch any typos or errant values that the user might make rather than silently ignoring them and using defaults.

So anyway, I think the point of this issue "Mechanisms for more restricted config files" is probably complete. I think the config reader is now completely restricted to only allow valid values at all levels of the hierarchy. (Modulo any bugs I may not have caught yet of course.)

Question, since I kind of shanghai-ed this branch from Jim and Barney:

Is there any reason to keep the files you two had been working on in this effort? I.e. config/definition.py, config/machinery.py, and config/generators.py. As far as I can tell, they are not in a currently working state. There are also tests/test_config_machinery.py and tests/test_config_simple.py which both fail (even after re-enabling those modules, which are currently disabled from the galsim installation). So I don't think they are currently useful. But if they are potentially future-useful, we can save them somewhere in the active repo, rather than merely in the historical record. Any thoughts on that?

My next step will be to try to write a few unit tests myself about the config stuff, which I haven't done yet.

Then I have some ideas about which things should be next to add to the config parser. But I'll make a new issue for doing that when I'm done here.

TallJimbo commented 12 years ago

I don't think there's any need to save the tests in the current repo. But it occurs to me that if we didn't re-branch when Mike starting taking this branch over, we should at least go back, try to find the commit where that started, and tag it. I doubt we'll actually go back to it, but it might be useful. And it should definitely be a part of the historical record of how we ended up where we are.

rmjarvis commented 12 years ago

The commit from which I started editing was 3e3c896d99c7a4613b46d780907ff986d98c9c88. How do you want to tag it?

TallJimbo commented 12 years ago

Probably with just some descriptive name ("#148-pre-yaml"?). I suppose just putting that commit hash in the issue might be enough, too.

barnabytprowe commented 12 years ago

Is there any reason to keep the files you two had been working on in this effort? I.e. config/definition.py, config/machinery.py, and config/generators.py. As far as I can tell, they are not in a currently working state. There are also tests/test_config_machinery.py and tests/test_config_simple.py which both fail (even after re-enabling those modules, which are currently disabled from the galsim installation). So I don't think they are currently useful. But if they are potentially future-useful, we can save them somewhere in the active repo, rather than merely in the historical record. Any thoughts on that?

I'm comfortable with a tag, but so far we do have a fairly clean set of tags (for Milestones 1-3) and I'd be happy just making a record here of the fact that Mike's edits started from commit 3e3c896. We could maybe also mention that in the Pull Request.

Then I'd be happy to delete those files in the repo. And I'm also very happy with the other changes (size-> single etc.), along with all your work Mike. I don't feel Shanghai-ed, just more than faintly inadequate! :)

Speaking of the eventual Pull Request, I should fill you in with a timeline Rachel and I are hoping to aim at having got a few volunteers to look through the code and check documentation etc. for consistency. Clearly this should be merged before they start, as it represents some big changes. We'd like, in an ideal world, to open a Pull Request for this Issue by tomorrow, at which point Rachel and myself would commit to going over it thoroughly but quickly with a view to a merge on Thursday. Combined with a similar timeline for merging the variable shear fields, this would allow us to send out some parts of the python code for people to look at on Friday, before the weekend.

Mike, do you think we might be able to get this into a Pull Request by some point tomorrow? I would venture to suggest that if unit tests remain as a hold-up then we could even add those to a new issue of themselves: this is still a fairly large overhaul for the objects in base.py which will be a primary part of the documentation consistency checking, so it would be good to bring in these changes before we ask people to look at the code, which needs to start soon.

rmandelb commented 12 years ago

I agree with you, Mike, that you've done what was called for here in this issue (aside from the unit tests). Everything else can wait until we've played with it some more and gotten a sense for which improvements are most important, which can be dealt with in a separate issue as you said.

Am I right in thinking that we can close #153 since it is made defunct by this reworking of config?

rmjarvis commented 12 years ago

Am I right in thinking that we can close #153 since it is made defunct by this reworking of config?

Yes. Others in play as closable are :

153 As you mentioned.
195 Obviated by #234
219 May still be a good idea, but not as important now. At least the specific concerns put forward aren't relevant anymore.
220 The concerns raised there aren't relevant anymore.
234 That's merged into this branch, so we can effectively pull request them together.

rmjarvis commented 12 years ago

Mike, do you think we might be able to get this into a Pull Request by some point tomorrow?

I'll see how much unit testing I can finish up by tonight. I'll put a pull request in tonight or tomorrow with whatever I have done by then.

GalSim-developers / GalSim

Mechanisms for more restricted config files #148

153 As you mentioned.

195 Obviated by #234

219 May still be a good idea, but not as important now. At least the specific concerns put forward aren't relevant anymore.

220 The concerns raised there aren't relevant anymore.

234 That's merged into this branch, so we can effectively pull request them together.