LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Fix Corrupt warp files #400

Closed heather999 closed 2 years ago

heather999 commented 3 years ago

As discussed today at the DESC DM/DA meeting, the DR2 processing at NERSC uncovered a small number (~60) of corrupt warp files. These warp files were copied from CC's Run2.2i DR6 repo (/sps/lssttest/dataproducts/desc/DC2/Run2.2i/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1*), with checksum checking on. Spot checks of some of these files confirm that the files at CC suffer from the same corruption and are truly identical to what we have at NERSC. It would be nice to understand further what happened with these files and to confirm that the problems with these warps did not impact the DR6 WFD processing.

The Run2.2i DR6 repo includes the forced_src* files for the specific warps that were identified as corrupt and they have timestamps well after the timestamps of the warp files. It would seem the processing proceeded without problems.

It would help to confirm that the SRS workflow logs for DR6 are here: /sps/lsst/users/descprod/Pipeline2/Logs/DC2DM_DRP/2.9/task_coadd/task_coadd_tract_patch/task_coaddDriver/ we may be able to use those to confirm there were no processing errors logged.
Recall there was some reprocessing and manual clean up during the DR6 processing - so I'm a little unclear if the SRS logs will tell us the full story or if they would be fully located in the above directory.

Here is a list of the identified corrupt warps:

deepCoadd/g/4023/2,0/warp-g-4023-2,0-183813.fits
deepCoadd/u/4023/5,4/warp-u-4023-5,4-200812.fits
deepCoadd/r/3829/6,2/warp-r-3829-6,2-193233.fits
deepCoadd/r/4023/1,3/warp-r-4023-1,3-202627.fits
deepCoadd/r/3830/2,5/warp-r-3830-2,5-236833.fits
deepCoadd/r/3830/2,5/warp-r-3830-2,5-212118.fits
deepCoadd/r/3829/1,6/warp-r-3829-1,6-193147.fits
deepCoadd/r/3829/1,3/warp-r-3829-1,3-236833.fits
deepCoadd/r/4644/5,3/warp-r-4644-5,3-193164.fits
deepCoadd/r/3829/3,2/warp-r-3829-3,2-40327.fits
deepCoadd/r/4229/1,0/warp-r-4229-1,0-213560.fits
deepCoadd/r/5067/0,1/warp-r-5067-0,1-181940.fits
deepCoadd/i/4853/6,1/warp-i-4853-6,1-206346.fits
deepCoadd/i/4853/6,1/warp-i-4853-6,1-211396.fits
deepCoadd/i/4023/6,3/warp-i-4023-6,3-214466.fits
deepCoadd/u/4853/4,6/warp-u-4853-4,6-235784.fits
deepCoadd/u/4023/4,4/warp-u-4023-4,4-218325.fits
deepCoadd/i/4027/5,0/warp-i-4027-5,0-227976.fits
deepCoadd/i/4027/2,5/warp-i-4027-2,5-227882.fits
deepCoadd/u/4641/3,6/warp-u-4641-3,6-219141.fits
deepCoadd/i/3443/3,3/warp-i-3443-3,3-192358.fits
deepCoadd/i/3450/3,0/warp-i-3450-3,0-256382.fits
deepCoadd/u/4434/0,4/warp-u-4434-0,4-179998.fits
deepCoadd/u/4434/0,4/warp-u-4434-0,4-200752.fits
deepCoadd/z/4023/3,4/warp-z-4023-3,4-187500.fits
deepCoadd/z/4646/5,1/warp-z-4646-5,1-210477.fits
deepCoadd/z/4023/0,2/warp-z-4023-0,2-209816.fits
deepCoadd/z/4023/0,5/warp-z-4023-0,5-209030.fits
deepCoadd/z/4854/0,2/warp-z-4854-0,2-191379.fits
deepCoadd/u/4432/3,0/warp-u-4432-3,0-179970.fits
deepCoadd/z/4023/1,4/warp-z-4023-1,4-243018.fits
deepCoadd/z/4032/3,1/warp-z-4032-3,1-13279.fits
deepCoadd/u/4227/2,4/warp-u-4227-2,4-179970.fits
deepCoadd/y/4849/0,3/warp-y-4849-0,3-242287.fits
deepCoadd/y/4027/1,3/warp-y-4027-1,3-169810.fits
deepCoadd/y/4023/2,4/warp-y-4023-2,4-191169.fits
deepCoadd/y/4023/2,4/warp-y-4023-2,4-206052.fits
deepCoadd/y/4023/0,6/warp-y-4023-0,6-190489.fits
deepCoadd/y/4640/3,4/warp-y-4640-3,4-52543.fits
deepCoadd/y/4644/2,5/warp-y-4644-2,5-169830.fits
deepCoadd/y/4644/2,0/warp-y-4644-2,0-190332.fits
deepCoadd/y/4850/4,3/warp-y-4850-4,3-167899.fits
deepCoadd/y/4850/6,5/warp-y-4850-6,5-190264.fits
deepCoadd/y/4850/6,5/warp-y-4850-6,5-191179.fits
deepCoadd/u/4640/4,5/warp-u-4640-4,5-219916.fits
deepCoadd/u/3830/5,2/warp-u-3830-5,2-2336.fits
deepCoadd/u/4023/1,0/warp-u-4023-1,0-200936.fits
deepCoadd/u/3833/4,6/warp-u-3833-4,6-235893.fits
deepCoadd/u/4850/6,5/warp-u-4850-6,5-238629.fits
deepCoadd/u/4850/6,5/warp-u-4850-6,5-256304.fits
deepCoadd/u/4850/6,5/warp-u-4850-6,5-4565.fits
deepCoadd/u/4850/6,5/warp-u-4850-6,5-236478.fits
deepCoadd/u/4850/6,5/warp-u-4850-6,5-238642.fits
deepCoadd/u/3833/6,2/warp-u-3833-6,2-235893.fits
deepCoadd/u/3833/2,6/warp-u-3833-2,6-2333.fits
deepCoadd/g/4028/4,5/warp-g-4028-4,5-159478.fits
deepCoadd/g/4645/3,0/warp-g-4645-3,0-219183.fits
deepCoadd/u/4023/0,3/warp-u-4023-0,3-218325.fits
deepCoadd/g/4432/5,5/warp-g-4432-5,5-193822.fits
deepCoadd/g/4641/0,4/warp-g-4641-0,4-183892.fits
deepCoadd/g/4641/0,0/warp-g-4641-0,0-159521.fits
jchiang87 commented 3 years ago

Just a note: Since these were found from investigating failed patches in DR2 processing, these warp files are for Y1 visits that have propId==54 (except for some of the u-band/tract 4850 files). Therefore, it's likely there are many more warp files at CC-IN2P3 that are similarly flawed.

johannct commented 3 years ago

I searched for the first case in the list. The only log file I found is /sps/lsst/users/descprod/Pipeline2/Logs/DC2DM_DRP/2.9/task_coadd/task_coadd_tract_patch/task_coaddDriver/run_coaddDriver/022/021/008/002/logFile.txt and I could not find anything wrong there, but the end date predates the date of the file and the last lines in the log point to the fact that the warps were deleted by then.... So this must have been relaunched for some reason, though the initial run was ok. And I do not find any other log for a manual relaunch, sorry. Note that fitsinfo does not return a warning that the file is truncated

johannct commented 3 years ago

Note also that for this first case (4023), all the warp directories on disk have a more recent timestamp than the end products .fits and under deepCoadd-results/g/4023/. I can't seem to track where these warps come from, but the end products used down the line predates them.... Around 10-15 may is when the disk space filled up completely I think, after a series of issues of the isilon fs.

heather999 commented 3 years ago

Ok - so I do not know if we can manage to track this down. What I can say is that the list of warps above from Y1 have timestamps in the range of May 7 through May 26th.

Searching through the logs for errors, when I do find an example warp that has a cfitsio error - the file has been removed and no longer exists... and the timestamp of the log corresponds to mid-May before any clean up and relaunch.

jchiang87 commented 3 years ago

We should at least check the warp files (and probably other DR6 files as well) that are at CC-IN2P3 to see which ones have problems and should be moved out of the way. To identify the corrupted files in the DR2 repo, I ran

       for warp_file in warp_files:
            with fits.open(warp_file) as warp:
                try:
                    warp.verify(option='exception')
                    for i in range(1, 4):
                        warp[i].data.shape
                except Exception as eobj:
                    print(os.path.basename(warp_file), eobj)
                    output.write(warp_file + '\n')

where warp_files is a glob of the warp files in the deepCoadd folder. There may be better ways to check these files, but we should try to find all of the problematic ones regardless.

heather999 commented 3 years ago

Sure we can run that check, and make a note of the files and remove them at CC - but what are our intentions for these warps files? We purposely avoided copying them over to NERSC and will not be storing them with the DR6-WFD archive on tape. Are there plans to back this up at CC? As it stands the set of warps available is incomplete - do we want to retain any of them beyond the set used for DR2?

jchiang87 commented 3 years ago

Depending on the Run3.1i plans for template production, some of the DR6 warps in the DDF region may be useful. Otherwise, we can certainly delete them if there are no plans to use them otherwise. The main reason for finding which warps have issues right now would be to see the extent of the data corruption. If it seems pervasive (I suspect it might be), then the case for checking the other DR6 data is more compelling.

johannct commented 3 years ago

I had no intention of reusing the warps for 3.1i.

heather999 commented 3 years ago

Haven't forgotten about this.. I have some jobs running over the existing warps to check them using code based on what Jim posted above. Got a little stymied with the CC outage over the weekend.. I restarted some jobs and hopefully they'll finish up.

heather999 commented 3 years ago

Jobs are still going, but I started to take a look at the output. There are identified bad warp FITS files across all bands, across all years Y1-Y5, and a number of different tracts. The commonality is likely the date stamp of around May 8 - 15, which would seem to correspond to the disk filling up at CC.
Here are some lists by band:

/sps/lsst/users/hkelly/dr6-warps-checks(0)>wc *.txt
   78   520 12567 g-bad-warps.txt
   36   252  6401 i-bad-warps.txt
   21   154  3759 r-bad-warps.txt
   87   631 15182 u-bad-warps.txt
   23   174  4139 y-bad-warps.txt
    7    40  1194 z-bad-warps.txt
  252  1771 43242 total

Those files also include the error messages associated with those files. Most of the errors are one of:

buffer is too small for requested array

uncompressed tile has wrong size

Header missing END card

Verification reported errors:
HDUList's element 2 is not an extension HDU.