Refinements to the real galaxy calculations

rmandelb commented 12 years ago

Now that we have a way of simulating galaxies in HST training data, I'm opening this issue as a place to collect suggestions for changes to that part of the code. There are a number of obvious things to be done:

A thorough exploration of timing in realistic scenarios (shears, target PSFs, etc.), including tradeoffs between time spent in calculation vs. time spent on i/o, etc.
Adding more information to the training data catalogs as needed. For example, if we want to start whitening the noise in the original HST images, then we need to save some information about the noise covariances -- at minimum, a number for the diagonal variance and a few numbers for the level of correlations as a function of pixel separation.
Tests of the accuracy of shearing and dependence on interpolation kernel.
Decisions about how to handle 1000x as many training galaxies as have been included now.
Tests of the impact of some choices I made in putting the data into the repository. For SHERA, I saved large postage stamps that were padded with noise. Here, I've saved small postage stamps that are trimmed based on sextractor segmentation maps, and when SBProfile chooses how much to pad them, they get padded with zero, which can lead to some image artifacts that are visually jarring and that would complicate the process of dealing cleanly/consistently with the noise in the original HST images.
Exploration of the possibility of photon-shooting real galaxies. My preliminary tests suggest that the deconvolved galaxies look pretty bad (way too much ringing etc.), noticeably worse than the pseudo-deconvolved versions from SHERA. It might be worth my making a SBPseudoDeconvolve class that carries out pseudo-deconvolution, so we can check whether those would be easier to photon-shoot.

Other suggestions are welcome!

rmjarvis commented 12 years ago

Another point to investigate relevant to this is how to keep the pyfits calls reasonably efficient without going crazy on the memory when you have a lot of files to read. For the use case demoed in MultiObjectDemo.py script 3, where 100 objects all come from a single file, calling pyfits.getdata for each one was very slow. I added the preload() method to read the file once and store the hdu list in memory, but this won't always be a viable option.

It would be worth figuring out exactly why pyfits.getdata was slow. e.g. is it just because of disk I/O overheads of opening and closing the file 100 times? Or is pyfits really reading the full file just to get a single HDU out, so it had to do that 100 times? etc. If the latter, then that might suggest we should save the real galaxies each in their own file, so the read command only reads the file we actually want. If the former, then we do want to store a significant number of them in multi-extension fits files and preload at least a file's worth at a time to avoid the repeated file accesses.

rmandelb commented 12 years ago

Yes, this is exactly the kind of thing I was hoping to cover with this issue. Another question is even if pyfits is reading in the whole file to get out a single HDU, perhaps there is some other way to get pyfits to not do that, so we wouldn't have to preload OR store everything in separate files. I don't know enough about pyfits to know if this is an option.

This will be especially relevant when we have 100k training galaxies!

rmandelb commented 12 years ago

I'd like to revisit this issue since it's part of the 4th milestone, albeit at slightly lower priority than other items. We should decide which bits to prioritize for now and which could wait. I'm pinging the people who were on the telecon and/or expressed interest in working on the code for a while: @barnabytprowe @TallJimbo @rmjarvis @gbernstein @pmelchior @joezuntz @joergdietrich @rearmstr @dkirkby (with apologies to those who aren't interested in this particular issue).

Going down the list above with an emphasis on what seems important to do before a release of the code within GREAT3 and what tasks seem straightforward vs. which are open-ended:

1) timing for various shears and target PSFs: I think that for the initial release, we don't have to worry too much of the code is a factor of a few slower than we'd like for certain shears and target PSFs when doing calculations with realistic galaxies. We do have to worry about timing issues related to i/o that could be debilitating (see below) but let's treat that issue separately.

2) adding more info to the catalogs if needed: I'd say let's postpone this until we need it!

3) tests of accuracy of shearing as a function of interpolation kernel: this is super important for the challenge or for any use of the code for the basic shear tests. It's also potentially quite difficult to test and has the potential to be a complex and open-ended task.

4) decisions related to large training samples: right now we have 100 galaxies in the repo just as a basic test. Ideally I'd like to be able to handle the I<22.5 sample of 26k galaxies when we release this within the GREAT3 collaboration for testing. This will require investigation of time spent on i/o depending on how we store the sample (e.g., separate files vs. one large file) and possibly how we use pyfits to access the data. My feeling is that this is relatively important and also not as open-ended / complicated as (3). It's also a bit boring and technical, but that's life....

5) image padding: the padding of the images by zeros can lead to visually jarring artifacts in the simulated images. While this isn't an issue for accuracy of simulated galaxies, it looks kind of cruddy. So I'd say it's not as important ultimately as (3), but it is moderately important. I could investigate having SBProfile pad with a noise field, and I don't think that would be too hard (hopefully those aren't famous last words). However, the amount of padding itself can lead to accuracy issues depending on the interpolation kernel, so that is more important to check into, even if we end up leaving the padding as zeros.

6) photon-shooting investigations: this could be quick or could be a very long / drawn-out / technical investigation, depending on how things go. The advantage here is that shearing is exact, so we don't have to worry about issues related to accuracy of shearing via interpolation.

Tentatively I'd propose that for this milestone, we try to tackle (4), (5), then time-permitting (3) and (6), and not even attempt to get to (1) and (2) until after the initial release of the code in Sept. Thoughts? Suggestions?

rmjarvis commented 12 years ago

I would say that 4 is the most important to get done soon.

About photon-shooting (6), we've determined that photon shooting gets the noise in the image wrong if the interpolation kernel has negative regions (i.e. Cubic, Quintic, Lanczos). So if we want to use photon shooting with real galaxies, we should either make sure that the sky noise dominates, or use Bilinear interpolation. Or shoot way more photons than you normally need and then add noise back in, but that kind of defeats the purpose.

rearmstr commented 12 years ago

I'd be willing to investigate (4). It seems like a relatively simple task to start getting more familiar with the code.

rmandelb commented 12 years ago

I agree that (4) is relatively simple. How comfortable / familiar are you with pyfits? (We might want to try different ways of having pyfits access the data.)

The only caveat is that to try (4), you might want a larger dataset than the current 100, and you might want that dataset in a few different forms (different HDUs in big files, lots of separate files, etc.). I have all the training data and scripts to put it in various formats. So one possible working model would be for you to request what you need from me for these tests, and I can send it to you. Another option is that you'll have to download the dataset of 26K training galaxies, and I can hand over my scripts that I used to make the catalog of 100 galaxies (IDL, I'm afraid). This dumps more of the work on you since you'd have to learn my scripts, but OTOH it means you're not limited by my ability to promptly send you data that you request. What do you think? What is your time availability like?

rearmstr commented 12 years ago

I've only used the basic functionality in pyfits for reading in images and catalogs.

I think it sounds easier for me to download all the training set and learn to run your scripts. It seems better to be able to generate different formats without having to keep asking you. I should have time next week to devote to this. Where can I get the data?

rmandelb commented 12 years ago

You can download the data from IPAC: http://irsa.ipac.caltech.edu/data/COSMOS/images/shera_galaxy_postage_stamps/index.html

(check out the README, and download the tarball with the postage stamps, but you don't need the tarball with weight maps)

I'm on a telecon so I can't pull my scripts together, but I will try to do that for you tonight. It'll take a while for you to get all that data on your machine, anyway.

rmandelb commented 12 years ago

Sorry, I didn't finish that thought. If you have time for this during this week, then I propose to try to get you everything you need today. In addition to downloading the data and trying the scripts I send you later, you'll need to install GalSim and play with it a bit using the existing training data that is in there already. Perhaps we could aim to have some answers to question (4) by the end of July, since having a way to handle larger training datasets will facilitate a lot of the other work on this issue?

rmandelb commented 12 years ago

... and I just realized that what I stuck into the repo already depends on additional data that wasn't released on IPAC. I used the sextractor segmentation maps to trim the postage stamps so they wouldn't be so huge, relying on SBProfile to pad them after reading them in. SO if you want to run my scripts, I'll need to package up that data too.

rearmstr commented 12 years ago

How difficult is it to package the additional data? If getting everything in place for me to run the code is more effort than you generating things yourself, then it doesn't seem to make sense.

I think the end of July sounds like a reasonable time frame to have some answers.

rmandelb commented 12 years ago

Well, packaging the data isn't hard, but I don't have an easy way to send a tarball that is several GB. In practice I have sometimes handed around data like that using temporary accounts on people's computers, but that's not always possible. Any suggestions??

If not, then here's my suggestion: you can pick a number of galaxies you want to use for these tests, assuming you want to start with >100 (currently in the repo) and <26k. It would be trivial for me to rerun my scripts to make data in the same format as is currently in the repo but for a larger number of galaxies. Also, it would only be about 10 minutes of work to set up the script to write to individual files and then test that the change works, and set up the script to rerun that way. I imagine that a set of 1000 galaxies in these two formats would be enough for you to get started on these tests, and since these postage stamps are trimmed, the files would take up only ~150M, which I can easily put on my webspace.

If we decide after that that you need to do significantly fancier things with the galaxy images, then we can revisit this decision to have me generate the data files for you.

On Jul 16, 2012, at 10:46 AM, rearmstr wrote:

How difficult is it to package the additional data? If getting everything in place for me to run the code is more effort than you generating things yourself, then it doesn't seem to make sense.

I think the end of July sounds like a reasonable time frame to have some answers.

Reply to this email directly or view it on GitHub: https://github.com/GalSim-developers/GalSim/issues/168#issuecomment-7008004

Rachel Mandelbaum http://www.astro.princeton.edu/~rmandelb rmandelb@astro.princeton.edu

joergdietrich commented 12 years ago

A free dropbox account should have more than enough storage to share these files.

joergdietrich commented 12 years ago

I just realized they are a bit smaller than I remembered, but a few referrals among ourselves should add the required space. So, don't create accounts yet if you don't already have one.

rmandelb commented 12 years ago

I already have a dropbox acct but I don't have room for 10 GB in it. And I thought my CMU webspace was limited to less than that, but now I can't find documentation that gives any limit, so I will have to try it today once I get into the office.

rmandelb commented 12 years ago

When I estimated the size of the file, I completely forgot about the fact that segmentation maps compress much better than regular images so they take up much less space. I just stuck them in my dropbox, and sent an invitation to Bob. If you have the images from IPAC and the segmentation maps from my dropbox, you have most of what you need. The rest are just catalogs and scripts, which I will package up and send today/tonight.

rmandelb commented 12 years ago

Bob: some more info for you - I was just packing up my scripts to send to you, plus commenting them a bit more thoroughly, when I realized that they rely on another IDL package: http://www.astro.caltech.edu/~rjm/shapelets/code/ They rely on it only because that code has tools for handling sextractor catalogs, and once I had the shapelets code installed I figured why not rely on it for this. But if you don't have that software, you would need to either install it, or put in your own way for dealing with sextractor.

Anyway, I think that everything you need is now in the dropbox along with a README, but please let me know as questions arise.

barnabytprowe commented 12 years ago

Hi Rachel,

Apologies for coming late to this discussion. I agree with your broad designations:

Short term: (4) & (5) Medium term (but v important): (3) & (6) Longer term: (1) & (2)

My own feeling with (5) would be that we should maybe approach this from the other direction: restore the postage stamps to full size as our working set and see how much we can shrink them or what effects we get from replacing their noise fields with our own. I guess I'm a little unsure that I see the logic behind the statement:

5) image padding: the padding of the images by zeros can lead to visually jarring artifacts in the simulated images. While this isn't an issue for accuracy of simulated galaxies, it looks kind of cruddy.

If we can see boxy artifacts, is that not a cause for concern? It seems to me to rely on a kind of detailed balance around an assumed-zero background to be truly safe...

rmandelb commented 12 years ago

Hi Barney,

Apologies for coming late to this discussion. I agree with your broad designations:

Short term: (4) & (5) Medium term (but v important): (3) & (6) Longer term: (1) & (2)

My own feeling with (5) would be that we should maybe approach this from the other direction: restore the postage stamps to full size as our working set and see how much we can shrink them or what effects we get from replacing their noise fields with our own.

Yes, this is a fair way to approach that problem.

I guess I'm a little unsure that I see the logic behind the statement:

5) image padding: the padding of the images by zeros can lead to visually jarring artifacts in the simulated images. While this isn't an issue for accuracy of simulated galaxies, it looks kind of cruddy.

If we can see boxy artifacts, is that not a cause for concern? It seems to me to rely on a kind of detailed balance around an assumed-zero background to be truly safe...

Well, I may have been conflating two issues that might be handled differently in general.

1) In the original (large) postage stamps, there are other objects which I've had to mask out. Originally I was subtracting off the background and masking with zeros, but that created artifacts due to the zero-noise regions after PSF convolution. Those artifacts went away when I masked with a correlated noise field.

2) In this case we're doing something slightly different, since we've trimmed off the edges and then in SBProfile the images get padded with zero.

In both cases I think that padding with zero-mean correlated noise implicitly assumes that we've properly subtracted off the background, but in case (1), any deviations from this assumption are less noticable (since it's just little patches within the image) than in case (2) where the entire outer part of the image is affected. I guess what I'm saying is that now that you mention it, the boxy artifacts might signal an issue with this approach even in case (1) where it looked okay.

rearmstr commented 12 years ago

I have some initial results on trying to use a larger number of training galaxies. I created a sample of 1000 galaxies and split the images into different number of files: 10, 40, and 100. I then ran a modified version of RealDemo.py to see how long it took to create a galaxy on average. I also include how long it took if I use the preload option.

Preload = 0.03 sec/gal N10 = 0.25 sec/gal N40 = 0.07 sec N100 = 0.04 sec

I used the cProfile module to see where the most time was being spent. I noticed that when it was getting data from a file it would cycle through all headers to potentially do a checksum. This was using a rather old version of PyFits v2.4. I upgraded to the latest version v3.0.8 because I could see that there was a significant refactoring of the code and it didn't include this explicit step. Unfortunately, the times were worse with the latest version.

Preload = 0.03 sec N10 = 0.4 sec N40 = 0.09 sec N100 = 0.05 sec

The newer version has more advanced python usage, but the profiler still shows a majority of the time is spent reading headers. I looked through the code to see if you could read a file without explicitly reading all the headers, but I didn't see anything. Perhaps we should write one ourselves?

rmandelb commented 12 years ago

Hi Bob,

Let me make sure I understand properly. Your numbers for the "preload" option are if you have all 1000 galaxies in one file, just like the current training sample has all 100 galaxies in one file?

Does your "preload" time include the initial overhead to read in everything, or this is just the number for doing the calculations after everything is read in?

Assuming I've properly understand what your "preload" numbers mean, would it be fair to say that these results suggest we should either: a) have everything in one big file or a series of big files (e.g., 1k galaxies per file), and use the preload option to generate large numbers of realistic galaxies, or b) have the samples in lots of little files, potentially one file per galaxy, so that it doesn't have to cycle through each HDU when reading them in, or c) write our own routine to read in a particular HDU from a file without cycling through all of them, and then we can use one big file without having to preload?

Of course, (b) and (c) are not mutually exclusive...we could make a compromise, like 100 galaxies per file, with our own routine to read in individual galaxies at a time.

Rachel

rearmstr commented 12 years ago

Rachel,

Your numbers for the "preload" option are if you have all 1000 galaxies in one file, just like the current training sample has all 100 galaxies in one file?

It doesn't really matter how many galaxies you have per file, if you use preload the timing is the same. For these times, I did not include the initial overhead read. For 1000 galaxies it was a few seconds.

It seems like long term we want the (b) and (c) combo. For the next milestone, if we want to include all 26k galaxies the preload time is ~30 sec. Is this acceptable? Should we try to investigate writing a new read routine or put it off till later?

rmandelb commented 12 years ago

I just thought of one more consideration - at some point, preload won't be an option because there won't be enough RAM to have everything sitting in memory. The training sample of 100 galaxies takes up 6M (for galaxy + PSF images). If we have 26k galaxies, that would be 1.6G! My laptop is 4 years old and has 4G of RAM, which I think is sort of typical. So that would kind of work, I guess, but it's getting close to the limit. Eventually we'll have >4x as much training data, so it will no longer be a viable option to preload the entire training dataset.

Furthermore, this is with trimmed postage stamps. If our investigations suggest that we cannot trim them as severely as I've done here, then even for the 26k, preloading won't be an option.

Are there any other considerations that I'm missing?

rearmstr commented 12 years ago

If I understand how it works, it actually doesn't load the data into memory, only the headers. They are fairly small so I don't think it makes a big impact.

rmandelb commented 12 years ago

That wasn't my understanding. I thought it's actually sucked everything in, hence Mike's comment in the dox that "There are memory implications to this, so we don't do this by default." @rmjarvis , have I misunderstood?

rmjarvis commented 12 years ago

Preload definitely loads in all the image data. If you look at the getGal function, you can see that if the catalog is marked as having been preloaded, then it doesn't use any pyfits command to return the data. It's all in memory already.

joergdietrich commented 12 years ago

My understanding of pyfits internals is different. Pyfits copies data into memory only when an HDU's .data is accessed. I.e., for preloaded files this only happens in getGal(). According to the pyfits user manual:

''' When a FITS image HDU’s .data is accessed, either the whole data is copied into memory (in cases of NOT using memory mapping or if the data is scaled) or a virtual memory space equivalent to the data size is allocated (in the case of memory mapping of non-scaled data). '''

A quick test loading a large fits file confirms that the resident memory of the python process only increases once the .data accessed, pyfits.open() is not enough.

If this is still a problem, we might want to consider opening fits files with memmap=True, which is not the default and should load only those parts into memory that are actually needed. You still need enough virtual memory for the worst case scenario of accessing all data at once.

rmjarvis commented 12 years ago

OK. You're probably right. I didn't do any tracking of the memory usage through the program or anything. I just assumed that if we were simply accessing a .data member, then it already had to be in memory. But of course, pyfits can use some OO magic to hide a file operation behind the scenes.

rmandelb commented 12 years ago

Okay, so what you're saying is that preloading includes the time-consuming loop over all the HDUs that pyfits.open does, so that getGal or getPSF can simply read in the image without having to do that. And this means that preloading can speed everything up without taking up insane amounts of RAM.

IMO, if we're thinking about making simulations with millions of galaxies, and we have a training sample of ~100k galaxies, then taking 30s to a few minutes to pre-load is a really small price to pay if the timing per galaxy gets decreased by factors of many, which seems to be the case whenever we have a substantial number of training set galaxies grouped together into files. In that case, then when running a simple script to simulate small numbers of galaxies, we would not want to preload, but otherwise we would.

But, I would like to make a well-motivated decision about how to do this for a standard use case, which means we have to decide what is a standard use case. In my head, the case we most worry about is the one I mentioned above, where people want to test shear recovery for realistic galaxies to high precision, requiring a training sample of 100k (presumably grouped into chunks that are written to file together) that would get reused to make millions of simulated galaxies. And then for that use case, we want to make sure that time spent on i/o and the amount of required RAM are minimized. So, if we might have 100k training galaxies, could it be possible that even the small amount of memory required to preload the headers will be too much? How much space does a single one of those headers take, so we can figure out whether the training sample we'll have in a few months is too much to preload? I would like to think that far ahead.

Or, is there another standard use case that could be problematic and might require different considerations?

barnabytprowe commented 12 years ago

A very informative and interesting discussion... I hadn't realised that Pyfits could be made to work in this 'preload' fashion, and it seems to suit our needs rather well.

So, if we might have 100k training galaxies, could it be possible that even the small amount of memory required to preload the headers will be too much?

So, FITS headers are always "a multiple of 2880 bytes (equivalent to 36 80-byte keywords) long." (from the FITS primer http://fits.gsfc.nasa.gov/fits_primer.html). Let's assume a fairly big header with 144 keywords, for sake of argument. That's ~12 kB ~ 1e4 B. Multiplying that by 1e5 for a full set of galaxies gets us to a GB.

So, it will be close to our comfort zone on a laptop. But hardly out of the question. We can also probably allow users to optionally prune away much of this unwanted header information - 144 keywords is a lot.

rmandelb commented 12 years ago

Right now the headers are quite minimalistic, i.e. just like this:

XTENSION= 'IMAGE   '           /Image Extension created by MWRFITS v1.4a        
BITPIX  =                  -32 /                                                
NAXIS   =                    2 /                                                
NAXIS1  =                   47 /                                                
NAXIS2  =                   69 /                                                
PCOUNT  =                    0 /                                                
GCOUNT  =                    1 /                                                
END

not 144 keywords, so even with 1e5 training galaxies we'd be okay to preload (and, duh, I could have checked this earlier but apparently I wasn't using my brain this morning... sorry).

But this actually raises another point about what information we want to carry around where. For example, I ditched the WCS that came with the original COSMOS images, figuring that it's just a training galaxy and we don't need to carry around all the information about where it lives. But, in the catalogs, I have saved the COSMOS ID so that if someone does want to know more about the training galaxy, they can look it up. At some point if we have fits from Claire, we will want to save the parameters of those, so people can easily reconstruct a model fit for the galaxy. Right now the catalog is not large... it's 17K for the 100 galaxies in the repo, which scales up to a manageable size even with 1000x as many galaxies. But if we decide we need tons of other data, it could get a little crazy, so we should be careful in considering what info must be saved.

barnabytprowe commented 12 years ago

Yes. It will be better for us as users too if we can force ourselves to be somewhat economical with what info we carry around. Long, full catalogues are great but I find you start to get diminishing returns once you begin forgetting what information they contain :)

Just a quick caveat is that even if not a full multiple of 36 keywords are used, the FITS format nonetheless allocates blocks of 2880 bytes and not smaller, so the remaining keywords are simply padded. This means, at best, we still have ~300MB of header being carried around for 1e5 galaxies, even if each header were only 7 keywords long.

rmandelb commented 12 years ago

Well, perhaps this suggests we should keep some info that would otherwise go into the catalog in the header instead, up to 2880 bytes, once we have that much space allocated per header anyway? Though I guess having the critical info all in the catalog (and hopefully not too much excess!) is better than some in the header and some in the catalog.

barnabytprowe commented 12 years ago

As long as we can try to have a clear, relatively easily-remembered divide between what info is in the headers and what will come from external catalogues I'd be happy.

One way to do that would be to put a selection of the COSMOS info we think is most relevant into the header, up to 36 keywords. This is sort of the background info - the fits from Claire would then be "derived" info in a separate catalogue.

Another option would be to make our header as "GalSim specific" as possible, storing all the information (sky background, noise info, PSF info, Claire's fitting info) required to allow us make a pretty good simulation of the postage stamp in question. This also captures a lot of the properties of the postage stamp, but not all that we would necessarily want from COSMOS (e.g. RA, DEC etc.)

Finally, we have the option of putting any amount of information into a FITS binary table extension, although this may then need its own overhead for I/O and while neat might be just more of a pain than having a good catalogue.

rmandelb commented 12 years ago

Let me summarize this discussion:

a) We're going to stick with preloading, since it (i) is really fast and (ii) doesn't require us to carry around all the image data in memory, just the headers.

b) In principle, we still don't want to have just 1 huge file with images, because they would be very large, and people who want to just access a few objects (without the ~1min overhead of preloading) would then be limited by the need to find one image out of many HDUs. Plus, enormous files are just unwieldy for people to play with and pass around. We could perhaps imagine storing the images in groups of 1k, which means our current training sample would have 26 files and the eventual full one, ~100. This is very much up for discussion IMO, so other opinions are welcome.

c) I propose that we should have the FITS catalog have all the basic COSMOS information, and the FITS headers for the images will have GalSim-specific info, like galaxy model fit parameters.

d) We need a place to store these files. We're not putting 26k galaxy images into the repo! Given the current 7M for 100 galaxies, we would have 1.8G for 26k galaxies. Barney suggested that we open a GalSim-related dropbox which is fully public (at least for reading :) and tell people to get the data from there. This seems like a good solution to me, at least for now considering that we don't have the GREAT3 server set up, and probably won't for a few months. We also have some $ that could be used to maintain a non-free Dropbox site with more space if we expand the training dataset substantially (but at that point we might want to rely on people downloading from our server, which will be in place by that time).

Implementing the above (a)-(d) would take care of the most pressing order of business related to real galaxies. Well, we might also want to do something additional in the dox, like specifically recommending that people preload when they are simulating more than N real galaxy images, for some N. But that's minor.

However, we have other action items in this issue (5 of them!), some of which might involve changing the postage stamps themselves (for example, the current ones are trimmed pretty severely to contain no blank space). And we will be getting galaxy model fit parameters which we'll put into the headers. So the question is, will any of those changes take place soon enough that it's not worth our doing (a)-(d) now and getting all that data into the repo for our first release of GalSim within the GREAT3 collaboration next month, but instead waiting until we do the other action items?

I'm busy with variable shear fields, and will be for the conceivable future. @rearmstr , my understanding was that you were available mainly in July, and we're well into August. Do you see yourself wanting to address any of the other "real galaxy" action items, or are you done at this point? (which I would completely understand...) If you don't have the time/inclination to address any other action items in this issue, then unless someone else volunteers, I propose that we should do (a)-(d) now, and take a short break on this issue. And the question for you, Bob, is do you have time to do (a)-(d) now that you already have scripts set up to make a bunch of files with training set galaxies? If not, I completely understand and will do it myself, but if you do, then it would be helpful if you could do that. In either case, I really appreciate the tests that you did, as I really hadn't realized just how ideal this preloading option was in terms of speed and memory usage! Some very useful lessons about pyfits here...

rearmstr commented 12 years ago

I'd be happy to do (a)-(d).

We could perhaps imagine storing the images in groups of 1k, which means our current training sample would have 26 files and the eventual full one, ~100. This is very much up for discussion IMO, so other opinions are welcome.

c) I propose that we should have the FITS catalog have all the basic COSMOS information, and the FITS headers for the images will have GalSim-specific info, like galaxy model fit parameters.

Do we know which information want to keep in the headers. Is all the information we need in the set of files that you pointed me to or do we need other external data?

Well, we might also want to do something additional in the dox, like specifically recommending that people preload when they are simulating more than N real galaxy images, for some N. But that's minor.

I'm not sure there is a case for not preloading if we have ~1K galaxies per file. In that case each time you want to simulate one galaxy you will have to read all 1K headers.

I'd also like to address some of the other remaining issues here. Based on the discussion above it seems like issues (5) is the most pressing followed by (3) and (6). It seems reasonable for to at least (5) before the next milestone and perhaps begin to investigate the remaining two.

For (5) there seem to a couple of different directions already discussed. 1) modify SBProfile to add pad with noise instead of zeros and 2) keep more of the original postage stamp. Are there other avenues I should explore?

rmandelb commented 12 years ago

Cool, thanks. For the headers, I think we currently don't have anything. The main thing would be galaxy model fit parameters, which are 1/2 done but not complete, and probably won't be complete for the next ~2 months due to the fact that the person running the code is moving internationally. So for now, we'll just have the FITS catalog with the basic info.

The case for not preloading: here was my calculation - If we want to simulate just a few galaxies (let's say M) as a test case, but we preload, then the time it takes with 26k training galaxies is 30 (preload) + 0.03M seconds

If we don't preload, then the time taken is the time per galaxy to read in the 1k headers in whichever training sample file the galaxy lives in, i.e. (~1s)M

So for tests with <30 galaxies, preloading doesn't make sense in this scenario with 1k galaxies per file. But 1k is a number I pulled out of nowhere (based on vague considerations- that it's too annoying to have a file per galaxy since that's tons of files, but extremely huge files are unwieldy to pass around) so if we want to consider other choices of galaxies per file then we might revisit this.

As for the other issues: I do agree with you about the priority. For (5), those are the first things I would consider. There is a problem which is that we don't have a way of generating a correlated noise field (yet). We will need this functionality, so the question is whether it is worth developing now. For Shera, I sort of cheated and used bits of sky from COSMOS data, but this isn't a great solution, and it seems that instead of putting in a kludge like that and then revamping it later with truly random correlated noise fields once we have them, it would be better to do this properly from the start.

rearmstr commented 12 years ago

I think it makes sense to do things properly from the start, it seems to save time in the end. I don't have any experience with generating correlated noise fields. How big of an effort do think this would be?

rmandelb commented 12 years ago

Hi Bob - I think it could be done quite simply. Barney already has some tools to do that in python (slow!) though they aren't in GalSim... it's just linear algebra, and we've already included TMV so it should be possible to do this quickly in the C++ layer.

To be clear, I think these tools could easily be used to pad with correlated noise, but the noise would not be correlated across the boundary. Gluing in a correlated noise field with proper correlations to the existing noise field in the part of the postage stamp that has data would be harder, and probably not necessary.

On Aug 14, 2012, at 6:48 AM, Bob Armstrong wrote:

I think it makes sense to do things properly from the start, it seems to save time in the end. I don't have any experience with generating correlated noise fields. How big of an effort do think this would be?

— Reply to this email directly or view it on GitHub.

Rachel Mandelbaum rmandelb@andrew.cmu.edu http://www.andrew.cmu.edu/~rmandelb/

rearmstr commented 12 years ago

Good. It sounds the way to proceed is to write the correlated noise routine in C++ (using TMV) and allow for SBProfile to use this to pad the images.

Can you point me to Barney's python routines that do this.

rmandelb commented 12 years ago

@barnabytprowe , can I put those correlated noise routines from python into the repo, or at least send them directly to Bob?

Bob, Barney is away now so he might not be able to reply quickly. If he can't, then I will simply write a summary of what matrix manipulations are needed.

And... sorry to keep doing this, but I realize one thing I forgot: we need to decide on a noise covariance matrix. This means that we need to decide on one for each of the HST training galaxies, which don't have exactly the same covariance matrix. So, a few thoughts:

1) we have to make sure we get the per-galaxy noise covariance matrix? (they won't be quite the same because there are several causes of correlated noise, some isotropic and some anisotropic, some depending on the point in the focal plane, etc. Of course most will be quite similar.)

2) if we have postage stamps with lots of noise around the galaxies we can estimate it from the pixel values by masking out the galaxy (which i think Barney also has some code to do), but the covariance matrices will be quite large and, of course, with a finite correlation length, will have many values that are just zero. So what is the best way to carry this info around? We could imagine running a routine to get the full noise covariance for each real galaxy, then parameterizing it in terms of just a few numbers (e.g.: diagonal covariance, and then a few values of the correlation for various distances, with correlations assumed to be zero for larger distances). Those few numbers could get saved in the image header, and then we could have a routine that reconstructs the full covariance matrix from them for the generation of padded noise.

3) The operation in (2) sounds like overkill when it comes to padding with correlated noise that exactly matches the noise field near each galaxy. But, when we want to carry out noise whitening on the existing noise in the COSMOS galaxy images, which will be important for including fainter training galaxies, I believe we will need that information anyway (we have to take the original noise field, say what happens when we shear and convolve to the target PSF, and then whiten the resulting noise - so this requires carrying around the original noise field!).

Now, we could imagine doing something simpler for the purpose of padding with correlated noise - e.g., using some average level of noise correlations for the training set galaxies, so only the noise variance is allowed to vary per-galaxy - but, as in (3), we'll eventually want to do the job right. While we didn't plan to include the noise whitening and fainter training galaxies until later this fall, doing some of the preliminary work now (time permitting!) would help make the later work go faster.

barnabytprowe commented 12 years ago

Hi,

Certainly, please put the correlated noise routines into the repo - I have no idea how clear they are though. This is not a super trivial thing, and I do not think there is a single, well-understood "right" way to do things from the beginning. I would suggest you just implement the first run using a diagonal covariance matrix: that will already be a lot to get completed. We can get to generating the non-diagonal modes later. We agreed on the telecon to do this issue in an as-possible fashion, and I really don't think that I can commit time to it personally before next month. I also think that it would be inefficient for someone else to have to redo/re-think some of the ideas/pitfalls I've already worked through on this problem, so I'd urge you to find other areas of work if possible!

(One thing, you almost certainly want to be storing an correlation function, rather than a covariance matrix, as a representative of the noise covariance. The values can be read off if you assume translational symmetry, which we surely will be across a postage stamp.)

Barnaby Rowe Postdoctoral Research Associate

Department of Physics & Astronomy University College London Gower Street London WC1E 6BT United Kingdom

On 14 Aug 2012, at 16:33, rmandelb wrote:

@barnabytprowe , can I put those correlated noise routines from python into the repo, or at least send them directly to Bob?

Bob, Barney is away now so he might not be able to reply quickly. If he can't, then I will simply write a summary of what matrix manipulations are needed.

And... sorry to keep doing this, but I realize one thing I forgot: we need to decide on a noise covariance matrix. This means that we need to decide on one for each of the HST training galaxies, which don't have exactly the same covariance matrix. So, a few thoughts:

1) we have to make sure we get the per-galaxy noise covariance matrix? (they won't be quite the same because there are several causes of correlated noise, some isotropic and some anisotropic, some depending on the point in the focal plane, etc. Of course most will be quite similar.)

2) if we have postage stamps with lots of noise around the galaxies we can estimate it from the pixel values by masking out the galaxy (which i think Barney also has some code to do), but the covariance matrices will be quite large and, of course, with a finite correlation length, will have many values that are just zero. So what is the best way to carry this info around? We could imagine running a routine to get the full noise covariance for each real galaxy, then parameterizing it in terms of just a few numbers (e.g.: diagonal covariance, and then a few values of the correlation for various distances, with correlations assumed to be zero for larger distances). Those few numbers could get saved in the image header, and then we could have a routine that reconstructs the full covariance matrix from them for the generation of padded noise.

3) The operation in (2) sounds like overkill when it comes to padding with correlated noise that exactly matches the noise field near each galaxy. But, when we want to carry out noise whitening on the existing noise in the COSMOS galaxy images, which will be important for including fainter training galaxies, I believe we will need that information anyway (we have to take the original noise field, say what happens when we shear and convolve to the target PSF, and then whiten the resulting noise - so this requires carrying around the original noise field!).

Now, we could imagine doing something simpler for the purpose of padding with correlated noise - e.g., using some average level of noise correlations for the training set galaxies, so only the noise variance is allowed to vary per-galaxy - but, as in (3), we'll eventually want to do the job right. While we didn't plan to include the noise whitening and fainter training galaxies until later this fall, doing some of the preliminary work now (time permitting!) would help make the later work go faster.

— Reply to this email directly or view it on GitHub.

rearmstr commented 12 years ago

Ok, given your response Barney I will try to implement just a diagonal covariance matrix for each galaxy. I assume the best way to calculate this is to mask out the all the objects in each image by using the segmentation map and find the variance of the remaining pixels.

I also have a question about the best way to add this noise to the zero padded region. Rachel suggested doing it in SBProfile itself. If I understand how the different draw routines work there, the code determines how large of an image is needed and sets all the pixel values to zero. It then will add to the pixels the values from the function itself. It seems that the noise should be added after this point by checking if any remaining pixel values are zero. To do it in SBProfile itself we would need to pass a covariance matrix to the draw command. Is this a good way to approach this?

rmandelb commented 12 years ago

Well it depends how far ahead we want to look when we do this. If you only want to add diagonal covariances, then you could define a keyword for the draw or drawShoot commands, something like pad_variance, which says the variance one should use for padding the images with Gaussian noise.

Once we have to add correlated noise, then we have to start passing matrices between C++ and python, so that might be best done using Images. In that case, it could be pad_covariance -- which, if it's just a number, would be taken as single variance for uncorrelated noise, and if it's a numpy array or Image would be taken to be the covariance. Still, that extension to correlated noise is probably best done later, given Barney's message.

On Aug 17, 2012, at 10:30 AM, Bob Armstrong wrote:

Ok, given your response Barney I will try to implement just a diagonal covariance matrix for each galaxy. I assume the best way to calculate this is to mask out the all the objects in each image by using the segmentation map and find the variance of the remaining pixels.

I also have a question about the best way to add this noise to the zero padded region. Rachel suggested doing it in SBProfile itself. If I understand how the different draw routines work there, the code determines how large of an image is needed and sets all the pixel values to zero. It then will add to the pixels the values from the function itself. It seems that the noise should be added after this point by checking if any remaining pixel values are zero. To do it in SBProfile itself we would need to pass a covariance matrix to the draw command. Is this a good way to approach this?

— Reply to this email directly or view it on GitHub.

Rachel Mandelbaum rmandelb@andrew.cmu.edu http://www.andrew.cmu.edu/~rmandelb/

rearmstr commented 12 years ago

Sorry, I've been pulled away the last couple of weeks. Given the shorter time constraints it would seem your first option is a better near term solution and I don't have any experience passing things between C++ and python. I guess my question is how much you think this will help in solving the original problem?

rmandelb commented 12 years ago

Based on my past experience: when we look at the images padded with uncorrelated noise, it will be visually quite obvious that the noise properties are different in that region. But when we look at the resulting simulated images after shearing and convolving with a ground-based PSF, the difference won't be as obvious. It will be a definite improvement over the case where the padding is with zero. So, it is up to you as to whether you have the time and inclination to do this.

If you don't want to deal with learning how to pass things between C++ and python, then if you are up for doing the job in the C++ parts of SBProfile, I could do the rest. If you do want to learn this, I can point you to relevant examples in the code that would show you how to do this.

Looking at the big picture here: you had previously seemed interested in doing steps (a)-(d) in my comment above from 20 days ago. Are you still interested in that? And in the context of the planned release of GalSim within the GREAT3 collaboration, here are my thoughts:

even if the real galaxies are left as-is, with just 100 objects and no image padding etc., that's enough for people to start playing with
then there is the possibility of giving them the 26k training galaxies. If that's easy, it would be nice, but no pressure. (For context, it's lower priority than most of the tasks Barney mentioned in his e-mail to great3-code list this week.)
just like getting the 26k training galaxies in the repo, image padding is also slightly lower priority for now. If it can be done easily, that would be nice, but if not, we can simply state this as a goal for the fall, so users know that we recognize the limitations of what's currently in the repo (and this is not the only limitation!).
as for all the other things described in this issue, I think those are definitely ones for the fall, we shouldn't even attempt them now.

Hopefully @barnabytprowe will chime in here, but I think he and I are in agreement about these priorities.

rearmstr commented 12 years ago

Looking at the big picture here: you had previously seemed interested in doing steps (a)-(d) in my comment above from 20 days ago. Are you still interested in that?

Yes, those are definitely in my plans. I have generated 26k set and am planning on putting them on Dropbox as Barney suggested.

But when we look at the resulting simulated images after shearing and convolving with a ground-based PSF, the difference won't be as obvious. It will be a definite improvement over the case where the padding is with zero. So, it is up to you as to whether you have the time and inclination to do this.

I think there is still something I don't understand about how it would be used in SBProfile. Just adding noise to the "draw" method in SBProfile doesn't seem like it would accomplish what you want. Since it determines the needed size of an image fills with zero and then draws it. For an interpolated image, like a real galaxy, it would give zero outside the bounds of the initial image. Does this mean the noise needs to be added to SBInterpolate? Am I understanding things correctly?

rmandelb commented 12 years ago

Thanks for clarifying your plans about the 26k set of training galaxies.

Yeah, I'm sorry, it wouldn't be as a keyword to draw and drawShoot (that was from a different part of our discussion). It would be in the initialization of the SBInterpolatedImage, where the image is padded by some factor. Usually that's done by making a new image which has all zeros, and putting the training set galaxy image into the center of it. So we would have to set a keyword for the constructor of SBInterpolatedImage, something like pad_variance=0.0 (default value 0), and when pad_variance is nonzero, then SBInterpolated Image would determine which pixels are the padding and then populate them with noise with the desired variance. In practice the quickest way to do this might be to actually populate all the pixels with noise using one of our Image methods, then zero out the pixels where the smaller Image we have supplied will be located, and then add in our image.

Does that make sense to you?

rearmstr commented 12 years ago

Yes, that makes sense. It now seems more straightforward the changes that would need to happen. I should be able to implement these changes by the end of next week.

I noticed in the pysrc directory examples where images were being passed to c++ for the future case of passing in a covariance matrix. Since generating the correlated noise in on hold I think I will just focus on getting things to work for the diagonal case.

rmjarvis commented 12 years ago

For issue #277, it became clear that it's a bit hard to use the original image and psf stored in a RealGalaxy if you want to do something with them directly (rather than combined into the deconvolved profile). e.g. the RealDemo.py script draws them and has to put in a bunch of boilerplate around the draw command to get the image ready. Stuff that the GSObject draw command does for you.

So my proposal (which Rachel suggested belongs on this branch) was to make methods for RealGalaxy that return the original image and original PSF as GSObjects rather than using the internal attributes, which are SBProfiles. So

def getOriginalImage(self):
    return galsim.GSObject(self.original_image)

However, it occurred to me after that that we probably don't even need to store them as SBProfiles. We can just have the self.original_image and self.original_PSF be GSObjects directly (constructed as I did above). Then we can document their existence in the doc string for RealGalaxy (rather than implicitly through RealDemo.py) and users would be able to use them the way they normally do with GalSim objects.

GalSim-developers / GalSim

Refinements to the real galaxy calculations #168