LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Move aside unneeded sensor-visits in Year 4 and Year 5 datasets #390

Closed jchiang87 closed 4 years ago

jchiang87 commented 4 years ago

Here's a comparison of the CCDs that were expected to be simulated for each Run2.2i visit versus the sensor-visits that were actually simulated, as identified by the raw files at CC-IN2P3 in /sps/lssttest/datasets/desc/DC2/Run2.2i/sim/y[1-5]-wfd. For each visit, the list of expected CCDs was computed using the trim_sensors.py code in https://github.com/LSSTDESC/desc_sim_utils/tree/u/jchiang/refactor_obs_md_handling , which allows one to use the opsim db file to get the pointing information. Run2 2i_expected_vs_simulated_sensor-visits Each red point is the number of CCDs that are missing from the expected list for a given visit, and each blue point is the expected number of CCDs minus the number that were simulated.

For y4 and y5, many more CCD raw files were generated than were expected, hence all the negative blue points. Those extra sensor-visits must lie outside of the Run2.2i simulation region, so we should move those aside before doing the DRP processing. I'll post a list of raw files to move aside at this issue.

For the record, here are the statistics of missing vs expected sensor-visits for each year:

year # missing # expected fraction missing
y1 3393 647516 5.24e-03
y2 3237 554479 5.84e-03
y3 11 754387 1.46e-05
y4 303 743906 4.07e-04
y5 2 860173 2.33e-06
jchiang87 commented 4 years ago

The lists of unneeded y4 and y5 files are rather long, so I won't post them here. They are available at the CC-IN2P3 machines:

(lsst-scipipe-1172c30) [in2p3] pwd -P
/pbs/home/j/jchiang/dev/desc_sim_utils/work
(lsst-scipipe-1172c30) [in2p3] wc unneeded_y*.txt
    9203     9203   819067 unneeded_y4_0000_0643.txt
    8602     8602   765578 unneeded_y4_0643_1286.txt
    9208     9208   819512 unneeded_y4_1286_1929.txt
   10224    10224   909936 unneeded_y4_1929_2572.txt
    9907     9907   881723 unneeded_y4_2572_3215.txt
   10413    10413   926757 unneeded_y4_3215_3858.txt
   13198    13198  1174622 unneeded_y4_3858_4501.txt
   10284    10284   915276 unneeded_y4_4501_5144.txt
   11534    11534  1026526 unneeded_y4_5144_5787.txt
   14575    14575  1297175 unneeded_y4_5787_6430.txt
     128      128    11392 unneeded_y4_6430_6435.txt
   59485    59485  5300179 unneeded_y5_0000_0744.txt
   37939    37939  3414510 unneeded_y5_0744_1488.txt
   38028    38028  3422520 unneeded_y5_1488_2232.txt
   45472    45472  4092480 unneeded_y5_2232_2976.txt
   51118    51118  4600620 unneeded_y5_2976_3720.txt
   44772    44772  4029480 unneeded_y5_3720_4464.txt
   46109    46109  4149810 unneeded_y5_4464_5208.txt
   55582    55582  5002380 unneeded_y5_5208_5952.txt
   50983    50983  4588470 unneeded_y5_5952_6696.txt
   56298    56298  5066820 unneeded_y5_6696_7439.txt
  593062   593062 53214833 total

Each of the unneeded_y*.txt files contains a list of file paths to the unneeded raw files.

jchiang87 commented 4 years ago

To support the ingest at CC-IN2P3, I've made lists of the files that are needed to be processed:

(lsst-scipipe-1172c30) [in2p3] wc needed*.txt
    75854     75854   6751006 needed_y4_0000_0643.txt
    79161     79161   7045329 needed_y4_0643_1286.txt
    79755     79755   7098195 needed_y4_1286_1929.txt
    76231     76231   6784559 needed_y4_1929_2572.txt
    77046     77046   6857094 needed_y4_2572_3215.txt
    74543     74543   6634327 needed_y4_3215_3858.txt
    68779     68779   6121331 needed_y4_3858_4501.txt
    75708     75708   6738012 needed_y4_4501_5144.txt
    72415     72415   6444935 needed_y4_5144_5787.txt
    63628     63628   5662892 needed_y4_5787_6430.txt
      483       483     42987 needed_y4_6430_6435.txt
    77781     77781   6936499 needed_y5_0000_0744.txt
    91822     91822   8263980 needed_y5_0744_1488.txt
    92474     92474   8322660 needed_y5_1488_2232.txt
    89891     89891   8090190 needed_y5_2232_2976.txt
    84128     84128   7571520 needed_y5_2976_3720.txt
    89733     89733   8075970 needed_y5_3720_4464.txt
    89879     89879   8089110 needed_y5_4464_5208.txt
    80536     80536   7248240 needed_y5_5208_5952.txt
    84091     84091   7568190 needed_y5_5952_6696.txt
    79836     79836   7185240 needed_y5_6696_7439.txt
  1603774   1603774 143532266 total

@johannct

johannct commented 4 years ago

I concatenated your files separately for each year. I have 743603 entries for y4 and 860171 for y5

heather999 commented 4 years ago

Do we have a plan for how to deal with the unneeded files? I don't usually want to completely delete files - but in this case, I'm certainly open to it. For now, my intent at NERSC is to move them aside into a separate y*-outsideDC2region directory, store them on NERSC HPSS, and delete them on NERSC CFS. For the SQLITE tracking DB, I'm assuming these unneeded files should be ignored. Agreed?

jchiang87 commented 4 years ago

If there is space at CC-IN2P3, I would like to keep them around for a little while (maybe a couple weeks) since there are some things I'd like to investigate with them. For files at NERSC, I agree that we should archive them on HPSS in separate area and delete them from CFS. I'm not sure how we use the tracking db once files have been ingested, so I'm fine with omitting them or not. If we could add a column indicating they are "extra", that might be the most conservative way to proceed.

heather999 commented 4 years ago

The tracking DB keeps track of simulated files, their location, year, region, run (2.2i, 2.1..) whether we've checked that the FITS files are all proper (satisfies FITS standards, properly closed). The DB has no function as far as DRP, ingest, etc. Fabio has been making use of it as part of the data transfers of sims files. This was a larger concern when we had instances of improper FITS files - so the data transfer tool could easily skip sensor-visits that were "bad". @villarrealas what do you think of including those extra files and marking them "extra"? My naive thought is to keep them out of the tracking DB to avoid future confusion, as these files should never have existed in the first place.

jchiang87 commented 4 years ago

As a sanity check, I made depth maps using the y4 and y5 lists of sensor-visits inside and outside of the DC2 region. Here are the y4 maps: Run2 2i_y04_log10_depths_needed Run2 2i_y04_log10_depths_unneeded and the y5 maps: Run2 2i_y05_log10_depths_needed Run2 2i_y05_log10_depths_unneeded