LSSTDESC / SSim_DC1

Configuration, production, validation specifications and tools for the DC1 Data Set.
4 stars 2 forks source link

script to convert coadd catalogs to dask dataframes using the data butler #45

Open jchiang87 opened 6 years ago

jchiang87 commented 6 years ago

This is code to address #44 . Opening this PR so that comments on the code can be made.

jchiang87 commented 6 years ago

@cwwalter This code creates one chunk per patch, but your script had larger chunks (1M entries per chunk). Is it worth combining patches to have larger chunks? Or might it be desireable to have one chunk per patch (and label them that way)?

cwwalter commented 6 years ago

I think probably having one/patch is a good idea. That way someone could grab a single patch as a normal pandas data frame if they wanted to work with a single patch of image data.

I think the only downside is if you want to just grab one normal pandas data frame for tests you might not have enough statistics for it to be useful. How many entries typically per patch compared to the batched up 1M entries I was making before? I don't think (but don't really know) that having way more keys would negatively impact performance.

You are naming the keys by the patches correct? I put the type of file (dithered etc) in the key name with the idea that if something got renamed etc you could still see what it was by looking in the file. But, there might be better ways to do that.

jchiang87 commented 6 years ago

There appear to be ~60k-80k objects per patch (at least for imsim-undithered). If we chunk these by patch for now, it's probably easy enough to aggregate them later if desired. I agree it would be safer to put the simulation type (imsim-dithered etc) in the key name in case of file renaming, though there may be alternatives, e.g., associating metadata somehow.

cwwalter commented 6 years ago

OK this looks good. You are iterating on the field/keys in the schema. So, I guess this will get everything including the flags etc. Is this what you did to get what was loaded into the SQL database?

Just checking that there isn't anything extra to do to get all the entries we had before..

jchiang87 commented 6 years ago

Yes, this is what I did for loading the MySQL tables...seemed to work as desired. I have an example file here

/global/u1/j/jchiang8/DC1_work/SSim_DC1/scripts/coadd-DC1-imsim-dithered.hdf

with three patches,

$ h5ls coadd-DC1-imsim-dithered.hdf 
DC1_imsim_dithered_12_7  Group
DC1_imsim_dithered_13_7  Group
DC1_imsim_dithered_14_7  Group
$ 

in case you want to have a look.

cwwalter commented 6 years ago

Hmm... it's the typical I don't have permission to look in:

/global/u1/j/jchiang8/DC1_work/SSim_DC1/scripts/

problem.

BTW, before I also used my script to make the star and galaxy files. Since they were also in the database that worked. But, I guess the way you made those databases were different so we can't easily adopt this script to do that also, is that right?

jchiang87 commented 6 years ago

ah, sorry. I changed the group for DC1_work/SSim_DC1 directory tree to lsst. Can you try again?

For the star and galaxy files: yes we'll need a different function to read in the instance catalogs.

cwwalter commented 6 years ago

OK played with it and looks good. Doing the test reminded me of one last thing that might be missing:

I was looking at the code I used to do the flags etc manually before. I calculated and added the psfMag and cmodelMag to the dataframe (since everyone needs these for almost everything). I used the zero point you gave me:

flux_mag0 = 63095734448.0194  # Zero point for magnitudes

but the correct way to do it is to call the calibration object like at the end of the 'butler access demo' notebook from Simon.

I think we should probably add those to the data frame on creation so it is done properly. What do you think?

Also, maybe this deserves its own issue,but reviewing the code I wrote reminded me that currently in the galaxy table now (which comes from the SQL database) each galaxy component has it's own entry. So the bulge+disk galaxies need to be combined to have a combined magnitude along with a type of what sort of galaxy we are looking at.

So basically we should make one line per galaxy. I think @danielsf said there might be something that could do this already.

Also @kadrlica might comment: For the stars: do you have the stellar population information you need now for the DM studies?

jchiang87 commented 6 years ago

So I don't lose track of it, I added the script I used to compute the ugrizy apparent magnitudes for the objects in the instance catalogs that were input to imsim.py for DC1. We would just need to modify it to have the option to produce dataframes instead of csv files.

jchiang87 commented 6 years ago

I think we should probably add those to the data frame on creation so it is done properly. What do you think?

@cwwalter Sure, that sounds fine. Just so there is no ambiguity about what to do, can you provide a pointer to where you implemented the psfMag and cmodelMag calculations...or just add it to the make_coadd_dataframes.py script?

cwwalter commented 6 years ago

@jchiang87

Just so there is no ambiguity about what to do, can you provide a pointer to where you implemented the psfMag and cmodelMag calculations.

Sorry I didn't get to this yesterday.

The code I used is in:

https://github.com/cwwalter/DC1_Analysis/blob/master/Scripts/Process-DataFrame.py#L59

but this just is using the hardcoded zeropoint number you gave me.

I think the best way to do this is in:

https://github.com/LSSTDESC/SSim_DC1/blob/master/Notebooks/Butler%20access%20demo.ipynb

(from Simon)

I looked at the code in the commit above and it looks like this is what you did (minus restricting the entries on values and flags). This code looks good, I agree this is the way to do it.

cwwalter commented 6 years ago

So I don't lose track of it, I added the script I used to compute the ugrizy apparent magnitudes for the objects in the instance catalogs that were input to imsim.py for DC1. We would just need to modify it to have the option to produce dataframes instead of csv files.

Great. Thanks.

The other change is for galaxies we need to return a combined magnitude and also encode the sesic components of the constituent parts.

kadrlica commented 6 years ago

@cwwalter I think for right now we are happy with the stellar templates available. Once we dig in deeper we may have more detailed requests, but we aren't at that stage yet.

cwwalter commented 6 years ago

I think for right now we are happy with the stellar templates available. Once we dig in deeper we may have more detailed requests, but we aren't at that stage yet.

OK, thanks.