DUNE / data-mgmt-ops

3 stars 3 forks source link

Aaron found some files with same run and event numbers in the hitreco files for the atmnu production #626

Closed StevenCTimm closed 4 months ago

StevenCTimm commented 5 months ago

This production was done at NERSC so it would have been possible for the job to fail and still have a partial file and then be re-tried again and get a different bigger file, both of which could have been transferred to Fermilab and declared to SAM. the run and event numbers are the same but the random number seed is not, they are different particles, just have to figure out for sure how this happened.

StevenCTimm commented 5 months ago

Link to Aaron's slides: https://docs.google.com/presentation/d/1nuRmmaueyzgdRm1kvEYupGwKmkm7rjGZ8MKQAckV15Q/edit?usp=sharing

StevenCTimm commented 5 months ago

My recommendation is that he detach the smaller file of the two, which will always have a earlier timestamp. The origin of these files are that the job ran once at NERSC, aborted but produced a file still because we were writing to a local data store there. and then it ran again, same cluster and process id, and ran to completion. the process that declared all these files to SAM scooped up all the files that were in the output area without regards to completion code. Will hold this open until Aaron does the work (or delegates it to someone in DM).

StevenCTimm commented 4 months ago

Tracking in #659