STScI-Citizen-Science / MTPipeline

Pipeline to produce CR rejected, astrodrizzled, png's of HST WFPC2 solar system data.
6 stars 1 forks source link

Switch to MAST naming conventions #143

Closed ktfhale closed 10 years ago

ktfhale commented 10 years ago

We need to change the naming conventions of our files to match those desired by MAST, as specified here. We'll need to make some kind of choice regarding how to serve both the log and linear png images, but otherwise it should be fairly clear.

We'll need to rename existing files, and to change our pipeline's filename handling to match the new convention.

ktfhale commented 10 years ago

Here's what I think the new filename progression should look like. That is, here are the files for and output from each step of the pipeline. Astrodrizzle spits out all sorts of stuff that I don't think we care about, like dqmask.fits files and sci<1,2,3,4>.fits files for each extension. To my knowledge, those are useless to us, and I won't mention them here. I'll only mention the initial inputs, the final outputs, and whatever files are essential to go through. Well, I'll include the AstroDrizzle _wht.fits files, because we once did use those, and they are probably useful to scientists.

cr-rejection input: asdfghjkl_flt.fits or asdfghjkl_c0m.fits asdfghjkl_c1m.fits

cr-rejction output: hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_flt.fits or hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_c0m.fits hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_c1m.fits

AstroDrizzle will still see c0m, c1m, and flt, and it will presumably stay happy. I don't think we can give the pipeline version number as v1.0, as I remember Scott not wanting .'s, in the names, which makes sense.

I'll also differentiate between WFPC2 and the other instruments, as they result in distinct outputs

AstroDrizzle inputs: not cr-rejected: asdfghjkl_flt.fits or asdfghjkl_c0m.fits asdfghjkl_c1m.fits

cr-rejected: hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_flt.fits or hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_c0m.fits hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_c1m.fits

As we do currently in run_astrodrizzle, we'll need to have stuff that renames the outputs. Rather than adding _wide_ to _single_, as we do currently, we'll want to just get rid of `single':

AstroDrizzle outputs: not cr-rejected: hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_img.fits hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_wht.fits or hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_img.fits hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_wht.fits

cr-rejected: hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_sci.fits hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_wht.fits or hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_sci.fits hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_wht.fits

You may have noticed a difficulty. There's nothing to differentiate the non-cr-rejected _wht.fits file from the cr-rejected _wht.fits file. Fortunately, they are nearly interchangeable. Their actual data arrays are identical- only a few keywords in the header, which store filename information, are different. I think we can get away with one overwriting the other when AstroDrizzle is run, as long as AstroDrizzle doesn't worry about overwriting files.

PNG inputs: not cr-rejected: hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_img.fits hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_img.fits or hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_img.fits

cr-rejected: hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_sci.fits or hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_sci.fits

PNG outputs: not cr-rejected: hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_img-linear.png hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_img-log.png or hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_img-linear.png hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_img-log.png

cr-rejected: hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_sci-linear.png hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_sci-log.png or hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_sci-linear.fits hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_sci-log.fits

And finally, a summary of all the files that will exist at the end:

If WFPC2:

asdfghjkl_c0m.fits
asdfghjkl_c1m.fits
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_c0m.fits
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_c1m.fits
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_img.fits
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_sci.fits
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_wht.fits
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_img-linear.png
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_img-log.png
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_sci-linear.fits
hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_sci-log.fits

If a different instrument:

asdfghjkl_flt.fits
hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_flt.fits
hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_img.fits
hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_sci.fits
hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_wht.fits
hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_img-linear.png
hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_img-log.png
hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_sci-linear.png
hlsp_mt_hst_wfc3-uvis_asdfghjkl_ceres_F606W_v1-0_sci-log.png

My first task will be to change the hardcode test dictionary to match this.

ktfhale commented 10 years ago

I'm working through the filename handling, and I have a question. Why is the original file listed in in the cr-rejection outputs? Should we keep that? I'll keep that unless told otherwise.

EDIT: my guess is that it's there to preserve parity with the doubled nature (one output for the cr-rejected version, one output for the non-cr-rejected version) of the AstroDrizzle and PNG outputs.

ktfhale commented 10 years ago

I've redesigned make_output_file_dict() to match the new naming conventions, and I've fixed some things in the tests. All tests pass with the naming conventions specified above.

I 'll move on to making sure the individual steps write out the correct filenames. There's a bit of a design choice that I'd like to make.

At the moment, the AstroDrizzle and PNG steps have filename handling in them. It's most significant in the AstroDrizzle step, where we need to rename the outputs from AstroDrizzle to match what we want. We specify what we want in run_astrodrizzle.py. I think all the filename handling should be controlled centrally, from output_file_dict. That is, the filenames specified in output_file_dict should be the names each step uses for its outputs.

This will require adding another level of detail to output_file_dict. The dictionary would need individual keys for every single output file, but I don't think that's too bad. Other than changing the manually specified testing dictionary, this wouldn't cause us to change the testing code itself. This would also simplify any future changes in the filename conventions. All that would have to be changed is make_output_file_dict() and its tests.

ktfhale commented 10 years ago

In my first attempt at make_output_file_dict(), I messed up the handling of the path to the input file. I believe I have corrected the issue, and I've added a set of expected outputs that have a path to the input file, rather than the input file just being at the current directory.

ktfhale commented 10 years ago

This last commit was a big one.

First, I made some syntax corrections and changed some variable names so the pipeline can actually run, and so the metadata was getting into the filenames correctly.

But more importantly, I've started basing the names of step outputs off of the output file dictionary.

Actually, run_cosmicx already did this. But I had to make some changes there so that the symlink is made with the correct format.

Renaming the AstroDrizzle outputs was, predictably, the hardest. I decided to simplify my life by straight-up deleting all the AstroDrizzle products we don't want. I delete every output that has 'mask' in its name. To my knowledge, we've never used those. I am still keeping the _wht.fits files.

It's very important to note that, as a result of how I'm using the output dictionary to provide the output filenames, the order in which the files appear in the output file dictionary's various lists is now crucial. Rather than changing the output dictionary so that every file has its own key, I'm enforcing a correspondence between elements in the lists.

That is, if asdfghjkl_flt.fits is the first element in the cr_reject_output list, then hlsp_mt_hst_wfpc2_asdfghjkl_mars_F606W_v1-0_img.fits, the corresponding non-cr-rejected AstroDrizzle output, needs to be the first element in drizzle_output.

Likewise, if hlsp_mt_hst_wfpc2_asdfghjkl_mars_F606W_v1-0_c0m.fits is the second element in the cr_reject_output list, then hlsp_mt_hst_wfpc2_asdfghjkl_mars_F606W_v1-0_sci.fits, the corresponding cr-rejected AstroDrizzle output, needs to be the second element in drizzle_output.

The pipeline currently runs fine up to png creation.

ktfhale commented 10 years ago

Getting png to output the correct filenames was just a matter of changing an _ to a -. While I was at it, I stopped passing the weight file to run_trim(), since we no longer need it do do saturated_clip(), and I got rid of saturated_clip() for good measure.

The pipeline now produces images with the MAST filenames.

ktfhale commented 10 years ago

I just realized that the filenames above are slightly incorrect. It should be ipsud-target, not ipsud_target. That is, instead of

hlsp_mt_hst_wfpc2_asdfghjkl_ceres_F606W_v1-0_img.fits It should be hlsp_mt_hst_wfpc2_asdfghjkl-ceres_F606W_v1-0_img.fits

I'll make the necessary changes to fix this.

ktfhale commented 10 years ago

Scott's provided some feedback. He thinks it would be better if we used 'linscale and -logscale as the identifiers for our png outputs. And the filter names should be, like the rest of the filename, in lower case.

I've edited the AstroDrizzle step slightly. Instead of copying a file to rename it, we just us os.rename. This should save us a bit of time on the disk.

I'm working on renaming all of our existing files to the new conventions, which is definitely a harder problem than just changing the filename handling in the code.

To make my life simpler, I'm moving all files in /astro/mtpipeline/mtpipeline_outputs/*/*/ that have center, mask, or d2im in their name to /astro/mtpipeline/trash. I've preserved a list of their original filepaths, just in case, so they can theoretically be put back. These are all AstroDrizzle outputs that we no longer desire.

We're also going to abandon the practice of making c1m symlinks for AstroDrizzle. If you move the files around, the symlink becomes obsolete. Instead, we'll just create a duplicate c1m file entirely. This shouldn't be all too challenging.

ktfhale commented 10 years ago

Last night I ran a script that tried to rename everything in /astro/mtpipeline/mtpipeline_outputs to the new naming conventions.

The script also produces a text file full of mv commands that should reverse this... again, assuming I nothing is inadvertently overwritten.

Unfortunately, it got stuck about 18% through. Looks like some of the wfpc2 pngs have read-only permissions.

I've changed the permissions of everything in /astro/mtpipeline/mtpipeline_outputs/wfc3 and acs. But I'll need Wally to run

chmod -R 775 /astro/mtpipeline/mtpipeline_outputs/wfpc2/ 

before I can rename all the files there.

ktfhale commented 10 years ago

I believe we have transitioned to the new filename conventions. A search of the archive for missing files under the new conventions reveals only files missing due to one of the 83 known bad files, or a missing WFPC2 logscale png output. I haven't run the pipeline with the log switch enabled on the wfpc2 folder, so I'm doing that now.

We also must merge my mast-names branch into master, as attempting to run master on /astro/mtpipeline/mtpipeline_outputs will result in unneeded reprocessing.

ktfhale commented 10 years ago

Now that the branch is merged, I think this issue can be closed