cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

Inconsistent exit codes for fatal root errors and xrootd errors #44878

Open kpedro88 opened 6 months ago

kpedro88 commented 6 months ago

When diagnosing failures in grid jobs, precise exit code meanings are very valuable. Currently, several exit codes are overloaded or used (in my opinion) inconsistently.

One category of errors is "Fatal Root Error". This occurs when there is some kind of file corruption. Corrupt files have to be fixed centrally, so it is very important to be able to isolate these. Here are some examples of the exit codes that can occur with "Fatal Root Error" and related exception messages:

exit code 84:

----- Begin Fatal Exception 30-Apr-2024 04:07:50 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootFileSequenceBase::initTheFile()
   Additional Info:
      [a] Input file root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/MINIAOD/UL2018_MiniAODv2-v3/280000/AB8F5CB8-2B5B-1945-967F-89EDEC3346AD.root could not be opened.
      [b] Fatal Root Error: @SUB=TStorageFactoryFile::Init
root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/MINIAOD/UL2018_MiniAODv2-v3/280000/AB8F5CB8-2B5B-1945-967F-89EDEC3346AD.root not a ROOT file

----- End Fatal Exception -------------------------------------------------

exit code 85:

----- Begin Fatal Exception 30-Apr-2024 03:08:20 BST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Reading branch EventAuxiliary
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-25451) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------

exit code 86:

----- Begin Fatal Exception 30-Apr-2024 04:07:56 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 84983, fKeylen = 98, fObjlen = 395391, noutot = 0, nout=0, nin=84885, nbuf=395391

----- End Fatal Exception -------------------------------------------------

Another category of errors are xrootd issues. These are usually transient, though they can indicate that a file is not accessible anywhere on disk. Here are some examples:

exit code 84:

----- Begin Fatal Exception 27-Apr-2024 16:21:06 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootFileSequenceBase::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAODv2/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/00000/3E80D215-AFE7-8D46-A8D8-56D2E670C280.root'
   Additional Info:
      [a] Input file root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAODv2/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/00000/3E80D215-AFE7-8D46-A8D8-56D2E670C280.root could not be opened.
      [b] XrdCl::File::Open(name='root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAODv2/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/00000/3E80D215-AFE7-8D46-A8D8-56D2E670C280.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
' (errno=3011, code=400). No additional data servers were found.
      [c] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/mc/RunIISummer20UL18MiniAODv2/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/00000/3E80D215-AFE7-8D46-A8D8-56D2E670C280.root?tried=+1213cmsxrootd2.fnal.gov1213xrootd.unl.edu,
      [d] Problematic data server: cms-xrd-global.cern.ch:1094
      [e] Disabled source: cms-xrd-global.cern.ch:1094
----- End Fatal Exception -------------------------------------------------

exit code 85:

----- Begin Fatal Exception 30-Apr-2024 11:02:48 CST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Calling InputSource::getNextItemType
   [1] Reading branch EventAuxiliary
   [2] Calling XrdFile::readv()
   [3] XrdAdaptor::ClientRequest::HandleResponse() failure while running connection recovery
   [4] Handling XrdAdaptor::RequestManager::requestFailure()
   [5] In XrdAdaptor::RequestManager::OpenHandler::HandleResponseWithHosts()
Exception Message:
XrdCl::File::Open(name='root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL17MiniAODv2/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/230000/1D057463-6DF2-5849-B64C-B1CCB513876F.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
' (errno=3011, code=400)
   Additional Info:
      [a] Original error: '[ERROR] Operation expired' (errno=0, code=206, source=t2srv0015.cmsaf.mit.edu:1094 (site T2_US_MIT)).
      [b] Original failed source is t2srv0015.cmsaf.mit.edu:1094 (site T2_US_MIT)
      [c] Disabled source: t2srv0015.cmsaf.mit.edu:1094
----- End Fatal Exception -------------------------------------------------

I propose reserving exit code 84 for xrootd file open error (indicating file missing from disk), exit code 85 for xroot file read error (transient), and exit code 86 for all fatal root errors (indicating file corruption). I am open to other proposals to resolve these ambiguities.

cmsbuild commented 6 months ago

cms-bot internal usage

cmsbuild commented 6 months ago

A new Issue was created by @kpedro88.

@smuzaffar, @sextonkennedy, @antoniovilela, @Dr15Jones, @rappoccio, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 6 months ago

assign core

cmsbuild commented 6 months ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 6 months ago

@kpedro88 What CMSSW version(s) are these?

One "incorrect exit code for FileReadError" was investigated and fixed in https://github.com/cms-sw/cmssw/issues/42179

makortel commented 6 months ago

Whose exit codes these 84, 85, 86 are? (like the exit code from cmsRun reported in the framework job report XML file, or error code from some intermediate WM tool)

I'm asking because these values do not correspond to the codes the framework would report in the framework job report. In https://twiki.cern.ch/twiki/bin/view/CMSPublic/StandardExitCodes I noticed

84 - Some required file not found; check logs for name of missing file. 85 - Job failed to open local and fallback files.

and to me these seem to be reported by some other component than the framework. In order to improve the framework's exit code treatment we'd need to know the framework's exit codes in these situations.

makortel commented 6 months ago

Assuming the 84, 85, 86 would be the process return values seen by the running script, the corresponding framework exit codes could be

kpedro88 commented 6 months ago

This is back in 10_6_X (processing UL samples). It doesn't look like #42179 was backported that far back, but it's good to know about it.

These exit codes are the result of $? in bash immediately after cmsRun exits.

makortel commented 6 months ago

This is back in 10_6_X (processing UL samples). It doesn't look like #42179 was backported that far back, but it's good to know about it.

Reminding myself more of that issue, it seems 10_6_X was not affected (i.e. 10_6_X was reporting 8021 / 85 properly).

makortel commented 6 months ago

The behavior 4 of the 5 cases in the description appears to be as expected. The exception being the 3rd case, that we'd like to be reported as FileReadError such as in the 2nd case.

We could think of adding a new error code FileNotFound to distinguish the case where e.g. file system or xrootd tells us they can't find the file, from cases where the file exists but something fails during the opening (such as in case 1 and 4). On the other hand, this "something fails" contains very many possible failure modes that we don't believe to have enough information on to reliably distinguish from each other, so I'm not sure if separating FileNotFound would be really useful or not. I'm concerned we would end up reporting some xrootd file open errors as FileNotFound while the real reason was something else.

The error in case 5 is a bit complicated. It occurs at a time when the framework expects the Source to be reading data from a file whose open succeeded, the xrootd connection gets bad, and AAA (XrdAdaptor) decides re-try to open the file by asking the redirector to use a different server, and that file open fails because of file not found. So from framework perspective, the FileReadError is the correct error type. On the other hand, FileNotFound, would be the underlying cause. So would we need FileNotFoundWhileReadingFile or something similar?

At the framework level we can't really do a detailed analysis of the various causes of the errors. Any adjustment of the present error causes to categories (except for the case 3) would also need to communicated and agreed with WM and CRAB.

kpedro88 commented 6 months ago

Clearly there is some ability to distinguish between these cases within the software: the exception messages for cases 1 and 2 contain the phrase "Fatal Root Error", while the exception messages for cases 4 and 5 do not.

What is the benefit of reporting these exceptions, which have vastly different causes, with the same code?

Alternatively, what is the cost of correctly classifying these exceptions with different codes, using the information already available?

The existing FatalRootError code is clearly not being used for all cases that include "Fatal Root Error" in the exception message, which seems straightforwardly inconsistent.

Regarding case 5, we have a FallbackFileOpenError already; maybe this warrants a FallbackFileReadError?

makortel commented 6 months ago

What is the benefit of reporting these exceptions, which have vastly different causes, with the same code?

Figuring out the causes reliably is difficult, especially when all we get is a string whose stability across external software updates is not guaranteed.

The existing FatalRootError code is clearly not being used for all cases that include "Fatal Root Error" in the exception message, which seems straightforwardly inconsistent.

The FatalRootError is a catch-all for many things ROOT reports via Error() (etc). In some cases, like when opening a file, there is higher-level context that overrides the exception category. Like in case 1, opening a file failed because of a fatal ROOT error. The FatalRootError just by itself doesn't indicate anything else than ROOT issued an error (could be in IO, could be in histograms, could be in Minuit, etc).

Regarding case 5, we have a FallbackFileOpenError already; maybe this warrants a FallbackFileReadError?

Presently the places that call reads do not know about the file being a "fallback file". What would be the benefit of knowing whether the FileReadError came from a file that was opened via the "primary catalog" vs some of the "fallback catalogs"?

makortel commented 6 months ago

One category of errors is "Fatal Root Error". This occurs when there is some kind of file corruption.

Coming back to this point, as I mentioned above, the FatalRootError exception category can have many different causes. Even if in most IO-related cases it suggests towards a file corruption, at the framework level we can not make such a judgement reliable. In fact, on any read error, we can not distinguish the actual causes of the bad bits, e.g. among "file is corrupt", "there was a network error", "there was a memory error", "there was a data race". Therefore, the best we can generally do (without at least a huge development investment), is to report "we encountered an error while reading, and here is all we know about it".

kpedro88 commented 6 months ago

Yes, bad bits could be caused by a network error. But often it is caused by actual file corruption at the source. Erring on the side of highlighting possible file corruption seems like a better balance to me than grouping this in with obvious network errors that aren't even coming from ROOT.

(My suggestion for case 5 was just based on your description of why it's complicated. My main goal here is to differentiate "file corrupted" from "transient network issue", which seems achievable at some level of accuracy already.)

makortel commented 6 months ago

For the errors originating from the xrootd layer we could think of adding RemoteReadError exit code.

belforte commented 6 months ago

Kevin pointed me to this. FWIW I have seen tracking input file read errors for a while now and have seen all sort of errors to be either persistent (actuall bad file) or transient (error in the network an/or the storage server). It does not help that in many cases (truncated file e.g.) xrootd does not say where it was trying to read from. A RemoteReadError will help in order to tell where we have more problems, but actions to take will be the same: automatic retries followed by automatic check of all replicas and retrasfer in Rucio.

makortel commented 6 months ago

Thanks @belforte for the feedback.

Just to summarize, the present proposal would be to

In addition, just to make it clear (from @kpedro88's request), the FatalRootError exit code should not get used for read errors, and we will (eventually) fix those to be reported as FileReadErrors.

While discussing with @kpedro88 I became to wonder the fate of FallbackFileOpenError. To remind, here the "fallback" refers to the additional file catalogs defined in site-local-config.xml after the "primary catalog". For such "fallback files", is it useful to know

belforte commented 6 months ago

wow ! Can you really differentiate "not found" from "file open error" ? And are/will you separate read errors of the "can't read basket" kind from "branch not found" kind?

I think we should keep this simple. I was about to propose a detailed classification scheme but.. in the end what matters is to identify problem enough for being able to decide which corrective action to take. When we end up with "verify checksum and possibly re-transfer", it matters little if problem was in open or read in the middle of the file. What would be great is to know from which RSE file was being read. At least if the site is local or remote. My favorite example is truncated file, xrootd does not say where it reads from. Only the global/regional redirector name is printed https://github.com/dmwm/CRABServer/issues/7548#issuecomment-1713880031

And reliable and easy-to-find information about the name of the file being read. Now my code in CRAB "scrolls the log up" looking for last "open", ugly and fragile.

That file was not found in primary catalog entry is not so important, there are many cases where jobs run at a site which does not have the input data local. We have outgrown considerably the initial "xrootd as fallback" and I do not see a turning back. And think about disk-less sites (we have some T3 like that already).

makortel commented 6 months ago

wow ! Can you really differentiate "not found" from "file open error" ?

At least in some cases we should be able to. E.g. in jobs that use our own TFileAdaptor layer (which is the case by default), where we handle raw file IO instead of root we should be able to distinguish "not found" from other (presumably more often higher-level) "file open" errors. At least for local files it should be straightforward, and for xrootd this error with code 3011 "No servers are available to read the file" looks like a good candidate to be classified as "file not found".

I'm not sure though how reliably we could do that for all the protocols we register a handler in TFileAdaptor. I'd expect many protocols to distinguish "not found" from other errors, but I also haven't taken a deep look. (then one can start asking questions like if we can't do this disambiguation reliably, should we try to do it at all? and do we still need all these protocols for which the handlers were developed 10+ years ago? etc)

And are/will you separate read errors of the "can't read basket" kind from "branch not found" kind?

This should already to large degree be the case. The "branch not found" should be

I think we should keep this simple.

I fully agree.

What would be great is to know from which RSE file was being read. At least if the site is local or remote. My favorite example is truncated file, xrootd does not say where it reads from. Only the global/regional redirector name is printed dmwm/CRABServer#7548 (comment)

This error message looks like something we might be able to improve (although it's also tricky, the layers above the xrootd do not know anything about the "actual file name", i.e. the PFN is the file name as far as they are concerned). Does that file still exist? (on the other hand it should be fairly easy to create a new truncated ROOT file)

And reliable and easy-to-find information about the name of the file being read. Now my code in CRAB "scrolls the log up" looking for last "open", ugly and fragile.

I think we should have something in the framework job report too, but that might be limited to LFN or PFN. In this case, are you after the PFN, or the "actual URL xrootd was reading from"?

That file was not found in primary catalog entry is not so important, there are many cases where jobs run at a site which does not have the input data local.

Ok, thanks. So if we would change the xrootd-related errors to be reported via a new RemoteReadError exit code, could we then stop issuing FallbackFileOpenError (8028) altogether?

belforte commented 6 months ago

indeed one can argue forever about "handling many/most while possibly obfuscating few" and get into a huge todo list with no major benefits. Let's stick to clear action items.

So if we would change the xrootd-related errors to be reported via a new RemoteReadError exit code, could we then stop issuing FallbackFileOpenError (8028) altogether?

IIUC the new error will cover both file open via xrootd adaptor failed and and 'read error happened after file was successfully opened'. Can you tell a RemoteRead from a local read on a site whjich uses xroot as the protocol to serve local data to WN's (EOS e.g.) ? I know that I sound like I am changing my mind here... it is a confusing topic, thanks for clarifying.

As to "which file gave this error", anything which contains a LFN is good. The actual xrootd URL with remove site indication would be even better, but may not make a dramatic difference.

Unrelated, quite possibly some protocols can be deprecated now. Facilities and SiteSupport (aka Stephan Lammel) is a good candidate to review the list.

P.S. my truncated file is still there. FWIW I simply used truncate command in bash shell. Anyhow truncated files were mostly a "Florida thing" which has largely disappeared by now :man_shrugging:

makortel commented 1 month ago

Can you tell a RemoteRead from a local read on a site whjich uses xroot as the protocol to serve local data to WN's (EOS e.g.) ?

No, we would treat any read errors through xrootd as RemoteReadError (and only when XrdAdaptor is being used, this is the default but it is technically possible for a user to turn it off... but maybe we don't need consider such cases)