Closed heather999 closed 2 years ago
Just a note: Since these were found from investigating failed patches in DR2 processing, these warp files are for Y1 visits that have propId==54
(except for some of the u-band/tract 4850 files). Therefore, it's likely there are many more warp files at CC-IN2P3 that are similarly flawed.
I searched for the first case in the list. The only log file I found is /sps/lsst/users/descprod/Pipeline2/Logs/DC2DM_DRP/2.9/task_coadd/task_coadd_tract_patch/task_coaddDriver/run_coaddDriver/022/021/008/002/logFile.txt
and I could not find anything wrong there, but the end date predates the date of the file and the last lines in the log point to the fact that the warps were deleted by then.... So this must have been relaunched for some reason, though the initial run was ok. And I do not find any other log for a manual relaunch, sorry. Note that fitsinfo does not return a warning that the file is truncated
Note also that for this first case (4023), all the warp directories on disk have a more recent timestamp than the end products deepCoadd-results/g/4023/
. I can't seem to track where these warps come from, but the end products used down the line predates them.... Around 10-15 may is when the disk space filled up completely I think, after a series of issues of the isilon fs.
Ok - so I do not know if we can manage to track this down. What I can say is that the list of warps above from Y1 have timestamps in the range of May 7 through May 26th.
Searching through the logs for errors, when I do find an example warp that has a cfitsio error - the file has been removed and no longer exists... and the timestamp of the log corresponds to mid-May before any clean up and relaunch.
We should at least check the warp files (and probably other DR6 files as well) that are at CC-IN2P3 to see which ones have problems and should be moved out of the way. To identify the corrupted files in the DR2 repo, I ran
for warp_file in warp_files:
with fits.open(warp_file) as warp:
try:
warp.verify(option='exception')
for i in range(1, 4):
warp[i].data.shape
except Exception as eobj:
print(os.path.basename(warp_file), eobj)
output.write(warp_file + '\n')
where warp_files
is a glob of the warp files in the deepCoadd
folder. There may be better ways to check these files, but we should try to find all of the problematic ones regardless.
Sure we can run that check, and make a note of the files and remove them at CC - but what are our intentions for these warps files? We purposely avoided copying them over to NERSC and will not be storing them with the DR6-WFD archive on tape. Are there plans to back this up at CC? As it stands the set of warps available is incomplete - do we want to retain any of them beyond the set used for DR2?
Depending on the Run3.1i plans for template production, some of the DR6 warps in the DDF region may be useful. Otherwise, we can certainly delete them if there are no plans to use them otherwise. The main reason for finding which warps have issues right now would be to see the extent of the data corruption. If it seems pervasive (I suspect it might be), then the case for checking the other DR6 data is more compelling.
I had no intention of reusing the warps for 3.1i.
Haven't forgotten about this.. I have some jobs running over the existing warps to check them using code based on what Jim posted above. Got a little stymied with the CC outage over the weekend.. I restarted some jobs and hopefully they'll finish up.
Jobs are still going, but I started to take a look at the output. There are identified bad warp FITS files across all bands, across all years Y1-Y5, and a number of different tracts. The commonality is likely the date stamp of around May 8 - 15, which would seem to correspond to the disk filling up at CC.
Here are some lists by band:
/sps/lsst/users/hkelly/dr6-warps-checks(0)>wc *.txt
78 520 12567 g-bad-warps.txt
36 252 6401 i-bad-warps.txt
21 154 3759 r-bad-warps.txt
87 631 15182 u-bad-warps.txt
23 174 4139 y-bad-warps.txt
7 40 1194 z-bad-warps.txt
252 1771 43242 total
Those files also include the error messages associated with those files. Most of the errors are one of:
buffer is too small for requested array
uncompressed tile has wrong size
Header missing END card
Verification reported errors:
HDUList's element 2 is not an extension HDU.
As discussed today at the DESC DM/DA meeting, the DR2 processing at NERSC uncovered a small number (~60) of corrupt warp files. These warp files were copied from CC's Run2.2i DR6 repo (
/sps/lssttest/dataproducts/desc/DC2/Run2.2i/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1*
), with checksum checking on. Spot checks of some of these files confirm that the files at CC suffer from the same corruption and are truly identical to what we have at NERSC. It would be nice to understand further what happened with these files and to confirm that the problems with these warps did not impact the DR6 WFD processing.The Run2.2i DR6 repo includes the forced_src* files for the specific warps that were identified as corrupt and they have timestamps well after the timestamps of the warp files. It would seem the processing proceeded without problems.
It would help to confirm that the SRS workflow logs for DR6 are here:
/sps/lsst/users/descprod/Pipeline2/Logs/DC2DM_DRP/2.9/task_coadd/task_coadd_tract_patch/task_coaddDriver/
we may be able to use those to confirm there were no processing errors logged.Recall there was some reprocessing and manual clean up during the DR6 processing - so I'm a little unclear if the SRS logs will tell us the full story or if they would be fully located in the above directory.
Here is a list of the identified corrupt warps: