Run CR rejection on all images

ktfhale commented 10 years ago

Once add_metadata is merged into master (Ticket #127), we can run CR rejection, but not Astrodrizzle or the png processing, on our entire archive. First, we need to get an estimate for how long this will take. I'm timing the execution times on my sample sets.

ktfhale commented 10 years ago

Assuming my 90 image sample set is representative, it should take about ~40 hours of processor time to run CR rejection on all of our data. If we divide that up amongst 4 cores each on the 4 science machines, it should take only a few hours to complete.

I've manually set the number of cores to 4 in the add_metadata branch, so I can get the pipeline running as soon as possible on the science machines. This will be fixed in Ticket #130.

At the moment, I need to do two things in order to run setup.py. I need to figure out how to get Ureka on the science machines, and I need to figure out where cfitsio lives on the science filesystem, since the science machines apparently can't see /sw/lib and /sw/include.

ktfhale commented 10 years ago

The way I'm currently getting the pipeline to run involves calling it on individual project folders in /astro/mtpipeline/archive/*/. This works, but it's not so good for the logging, as we'll have 966 individual logs when we're done.

After some finagling with Ureka and virtual environemnts, Science3 and 4 are currently running the pipeline on their respective list of project folders. I still can't get it to run on Science1 and 2. Each call to the pipeline produces the error

python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory

Not sure why it can't find libpython2.7.so.1.0. I'm running a grep search on science4 for this library, since, as the pipeline works on science4, it presumably has it.

ktfhale commented 10 years ago

I've yet to hear back about this bug on Science1 and 2, but apparently Science2 is not a public machine, and we should not use it.

The pipeline's finished on the portions of the data I gave to 3 and 4. I've started those machines working on the sets allocated to 1 and 2. All of our data should be CR rejected in ~2 and 1/2 hours.

ktfhale commented 10 years ago

Unfortunately, it appears as if the pipeline's encountering critical errors on some files. When it does, it appears to fail on every file in the project. So far, 123 of the output logs (one for each project folder) contain the string CRITICAL.

It's somewhat surprising that so common an exception-causing feature evaded my sample set. It's probably something like a varying number of extensions- something I assumed to be constant in all FITS files for a particular detector, that turns out not to be. Once the pipeline's finished doing what it can, I'll take a closer look and fix whatever's wrong.

ktfhale commented 10 years ago

A question: are we sure the logfiles are unique for each time the pipeline's run from the terminal, and there's no way one instance of the pipeline can end up writing to the same logfile as another? I'm finding logfiles that list files from multiple detectors, which shouldn't be possible if it's a one project folder -> one logfile mapping.

EDIT: I set both instances of the pipeline, on the science3 and the science4 machine, to log in the same directory. They're probably writing to each other's logfiles, which is unfortunate.

ktfhale commented 10 years ago

The pipeline's finished, and there seems to be a lot missing. We have contradictory information on how much is missing.

A script I wrote that looks for missing files based on the input filenames says ~13,000 are missing. But, across all of our logfiles, there have only been 9155 CRITICAL errors. And finally, gmail reports I have either 954 or 960 'Process Completed' emails, whereas I tailed 966 project folders.

Tomorrow, I'll do several things. I'll try to resolve the discrepancies between these numbers. I'll try running some of the failed files locally. Hopefully, it's a problem with running the pipeline on Linux, and not something weird about the files themselves that I somehow managed to miss 13,000 instances of.

ktfhale commented 10 years ago

I've discovered the cause of at least one error. If the image is taken by SBC or IR, run_cosmicx just runs shutil.copyfile() to produce an identical copy with _cr_ in the filename. This is failing frequently, because I don't have write permissions in many of the project folders. I do have write permission in some of them, but not all. I'll look into fixing this.

EDIT: I don't think I have write permission in any of the acs project folders. I'm not quite adept at reading these yet, but every project file in /astro/mtpipeline/archive/acs has these permissions: drwxr-xr-x. I'm fairly sure I need an 'r' in that middle group to be able to run the pipeline.

So I'm part of the STSCI/science group, while Wally is STSCI/mtpipe. How would I go about getting that changed?

ktfhale commented 10 years ago

So, out of the 13657 missing files, 9976 of them are from ACS (the entire ACS archive), 2132 are from WFPC2, and 1549 of them are from WFC3.

Hopefully, the pipeline works perfectly on all ACS data and we just have to change the permissions. I'll start looking for the WFPC2 and WFC3 problems.

ktfhale commented 10 years ago

I'm looking into the missing WFPC2 files now. Strangely, running the pipeline locally and on science4, I've been able to create the cr files for one of the project directories that was missing after I first ran my shell scripts (which call run_imaging_pipeline on each project folder). I've checked and made sure that the command to run the pipeline on that folder exists in one of my shellscripts. But none of the logs from any of yesterday's executions of the pipeline mention it.

I don't think the pipeline could have crashed before it wrote the filelist argument to the log. Since this folder's not appearing in any of the logs, I'm wondering whether the shellscript I was running somehow crashed before it finished? I figure I would have noticed that. Besides, if a single command in a shellscript fails, it just goes on to the next... so I'm mystified.

ktfhale commented 10 years ago

I've rerun cr rejection, excluding acs, which I still don't have write permission for. There were only three CRITICAL exceptions logged, for three files from three different projects. I previously ran across these files when I was gathering up header information, as astropy spits out warnings that they may have been truncated.

/astro/mtpipeline/archive/wfpc2/08699_comets/u65z1302m_c0m.fits
/astro/mtpipeline/archive/wfpc2/08876_comet_linear/u6aj1102r_c0m.fits
/astro/mtpipeline/archive/wfpc2/10860_kbo_quaoar/u9qs010ym_c0m.fits

I expect they've been damaged. I'll try finding and redownloading them from MAST.

ktfhale commented 10 years ago

I replaced these three files with versions downloaded from MAST. They had different hashes. I reran the pipeline on all the remaining missing files, including these, excepting the acs projects I can't write to. CR rejection was successful for all the input files.

We have CR-rejected outputs for all of our WFPC2 and WFC3 data. It looks like that if a worker hits a CRITICAL exception on one final, that may mean it doesn't do any of the remaining jobs. This is the only explanation I an think of for why, when the pipeline encountered only three CRITICAL exceptions from the three files above, it failed to perform CR rejection on hundreds of input files.

Regardless, I now have write access to the acs projects. I've started the pipeline on all our acs data, running with 16 cores on science4. I only needed a single call to run_imaging_pipeline, so the logging should all go in one nice file.

I started the pipeline 10 minutes ago. So far, it hasn't hit any critical errors, and it's about 1/3 done with all the acs files. I didn't randomize the input order, so we can't exactly assume it'll keep going at its present pace, but I'm hoping it'll be done before I head out.

ktfhale commented 10 years ago

The pipeline appears to have run successfully over all our ACS data, without a single exception. I don't think there are any input files without a _cr_ counterpart. I believe this ticket can be closed.

STScI-Citizen-Science / MTPipeline

Run CR rejection on all images #128