LDMX-Software / ldmx-sw

The Light Dark Matter eXperiment simulation and reconstruction framework.
https://ldmx-software.github.io
GNU General Public License v3.0
21 stars 19 forks source link

Check for corrupted / zombie files in ldmx-sw? #1406

Closed tvami closed 1 week ago

tvami commented 3 weeks ago

Is your feature request related to a problem? Please describe.

This https://github.com/LDMX-Software/LDCS/issues/23 made me think if we should have a built-in check for input files being corrupted / zombie.

Describe the solution you'd like

I'd think here https://github.com/LDMX-Software/ldmx-sw/blob/trunk/Framework/src/Framework/Process.cxx#L279 is where this should go, but I dont know enough about the EventFile class, and a quick look almost suggests that it's like a TFile wrapper. So this makes me think if we could do

if (!inFile || inFile->IsZombie()) continue;
tomeichlersmith commented 2 weeks ago

EventFile does not inherit from TFile but implementing a quick isCorrupted() function for it would make sense.

We actually already check for existence of both the file and the event tree in the EventFile constructor.

https://github.com/LDMX-Software/ldmx-sw/blob/d74f8ccc86524fb3de6fbc89ef70033ee6c61b73/Framework/src/Framework/EventFile.cxx#L53-L69

Perhaps we include a check for IsZombie in there as well?

Personally, I really don't like the idea of continuing smoothly past a corrupted file. I think the Framework should end the program if a corrupted file is encountered in order to inform the user of this corruption. Maybe we can get around the failing-on-the-last-file annoyance by making sure to Write the output file at the end of every input file?

tvami commented 2 weeks ago

I really don't like the idea of continuing smoothly past a corrupted file.

Well for the use case I ran into this, this would be not great. I had to ask LK to delete the input file and until they did it, I couldn't move on. (My use case was making a BDT skim, where I have 50 input files per output file)

So continuing on the bad file, and reporting that it was bad in the end of the workflow would be my preferred solution.

tomeichlersmith commented 2 weeks ago

This is showing my own ignorance around Rucio and LDCS - is it possible to mark files as bad and re-run a workflow so that bad files are skipped?

I'm just speaking from my own (non-Rucio, non-LDCS) experience where I make a file list and just remove the bad files from the list as I encounter them (if I can't personally delete them).

In any case, I'm comfortable with incorporating this skipping behavior, but I would like it to be designed with new users in mind as well. i.e. just making sure the error message is very clear and we don't accidentally silence the other error messages (no file or no tree). This might be a case where we add another flag to the constructor about "strictness" :thinking:

tvami commented 2 weeks ago

Rucio and LDCS - is it possible to mark files as bad and re-run a workflow so that bad files are skipped?

My understanding from @bryngemark is that this cannot be done.

This might be a case where we add another flag to the constructor about "strictness"

Yes another flag could work, so in default we'd exit if the file is problematic, but if the person doing this knows about the flag they can let the code skip files (the pool for people who use LDCS is anyway pretty limited)

bryngemark commented 2 weeks ago

That's right, there is currently no smooth way out of this with LDCS. What we do is specify an input data set, and a number of files per job, and then aCT queries the rucio database for actual file names and paths and chooses which files a given job will run on. Resubmission of a failed job (yet to be fully implemented in a completely useful way) would repeat the same job with the same input files. aCT does not know about the reason for the job failure.

A flag sounds like a sensible solution to me, and would be hugely useful when we run harsh skims. For book-keeping, though, it would be great if the number of successfully opened files could be reported somewhere the metadata harvesting for rucio could find it. Otherwise, we'll soon be lost when it comes to EOT calculations.