Open kpedro88 opened 6 months ago
cms-bot internal usage
A new Issue was created by @kpedro88.
@smuzaffar, @sextonkennedy, @antoniovilela, @Dr15Jones, @rappoccio, @makortel can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
@kpedro88 What CMSSW version(s) are these?
One "incorrect exit code for FileReadError
" was investigated and fixed in https://github.com/cms-sw/cmssw/issues/42179
Whose exit codes these 84, 85, 86 are? (like the exit code from cmsRun reported in the framework job report XML file, or error code from some intermediate WM tool)
I'm asking because these values do not correspond to the codes the framework would report in the framework job report. In https://twiki.cern.ch/twiki/bin/view/CMSPublic/StandardExitCodes I noticed
84 - Some required file not found; check logs for name of missing file. 85 - Job failed to open local and fallback files.
and to me these seem to be reported by some other component than the framework. In order to improve the framework's exit code treatment we'd need to know the framework's exit codes in these situations.
Assuming the 84, 85, 86 would be the process return values seen by the running script, the corresponding framework exit codes could be
84 = 8020 % 256
: FileOpenError85 = 8021 % 256
: FileReadError86 = 8022 % 256
: FatalRootErrorThis is back in 10_6_X (processing UL samples). It doesn't look like #42179 was backported that far back, but it's good to know about it.
These exit codes are the result of $?
in bash
immediately after cmsRun
exits.
This is back in 10_6_X (processing UL samples). It doesn't look like #42179 was backported that far back, but it's good to know about it.
Reminding myself more of that issue, it seems 10_6_X was not affected (i.e. 10_6_X was reporting 8021 / 85 properly).
The behavior 4 of the 5 cases in the description appears to be as expected. The exception being the 3rd case, that we'd like to be reported as FileReadError
such as in the 2nd case.
We could think of adding a new error code FileNotFound
to distinguish the case where e.g. file system or xrootd tells us they can't find the file, from cases where the file exists but something fails during the opening (such as in case 1 and 4). On the other hand, this "something fails" contains very many possible failure modes that we don't believe to have enough information on to reliably distinguish from each other, so I'm not sure if separating FileNotFound
would be really useful or not. I'm concerned we would end up reporting some xrootd file open errors as FileNotFound
while the real reason was something else.
The error in case 5 is a bit complicated. It occurs at a time when the framework expects the Source to be reading data from a file whose open succeeded, the xrootd connection gets bad, and AAA (XrdAdaptor
) decides re-try to open the file by asking the redirector to use a different server, and that file open fails because of file not found. So from framework perspective, the FileReadError
is the correct error type. On the other hand, FileNotFound
, would be the underlying cause. So would we need FileNotFoundWhileReadingFile
or something similar?
At the framework level we can't really do a detailed analysis of the various causes of the errors. Any adjustment of the present error causes to categories (except for the case 3) would also need to communicated and agreed with WM and CRAB.
Clearly there is some ability to distinguish between these cases within the software: the exception messages for cases 1 and 2 contain the phrase "Fatal Root Error", while the exception messages for cases 4 and 5 do not.
What is the benefit of reporting these exceptions, which have vastly different causes, with the same code?
Alternatively, what is the cost of correctly classifying these exceptions with different codes, using the information already available?
The existing FatalRootError
code is clearly not being used for all cases that include "Fatal Root Error" in the exception message, which seems straightforwardly inconsistent.
Regarding case 5, we have a FallbackFileOpenError
already; maybe this warrants a FallbackFileReadError
?
What is the benefit of reporting these exceptions, which have vastly different causes, with the same code?
Figuring out the causes reliably is difficult, especially when all we get is a string whose stability across external software updates is not guaranteed.
The existing
FatalRootError
code is clearly not being used for all cases that include "Fatal Root Error" in the exception message, which seems straightforwardly inconsistent.
The FatalRootError
is a catch-all for many things ROOT reports via Error()
(etc). In some cases, like when opening a file, there is higher-level context that overrides the exception category. Like in case 1, opening a file failed because of a fatal ROOT error. The FatalRootError
just by itself doesn't indicate anything else than ROOT issued an error (could be in IO, could be in histograms, could be in Minuit, etc).
Regarding case 5, we have a
FallbackFileOpenError
already; maybe this warrants aFallbackFileReadError
?
Presently the places that call reads do not know about the file being a "fallback file". What would be the benefit of
knowing whether the FileReadError
came from a file that was opened via the "primary catalog" vs some of the "fallback catalogs"?
One category of errors is "Fatal Root Error". This occurs when there is some kind of file corruption.
Coming back to this point, as I mentioned above, the FatalRootError
exception category can have many different causes. Even if in most IO-related cases it suggests towards a file corruption, at the framework level we can not make such a judgement reliable. In fact, on any read error, we can not distinguish the actual causes of the bad bits, e.g. among "file is corrupt", "there was a network error", "there was a memory error", "there was a data race". Therefore, the best we can generally do (without at least a huge development investment), is to report "we encountered an error while reading, and here is all we know about it".
Yes, bad bits could be caused by a network error. But often it is caused by actual file corruption at the source. Erring on the side of highlighting possible file corruption seems like a better balance to me than grouping this in with obvious network errors that aren't even coming from ROOT.
(My suggestion for case 5 was just based on your description of why it's complicated. My main goal here is to differentiate "file corrupted" from "transient network issue", which seems achievable at some level of accuracy already.)
For the errors originating from the xrootd layer we could think of adding RemoteReadError
exit code.
Kevin pointed me to this. FWIW I have seen tracking input file read errors for a while now and have seen all sort of errors to be either persistent (actuall bad file) or transient (error in the network an/or the storage server). It does not help that in many cases (truncated file e.g.) xrootd does not say where it was trying to read from. A RemoteReadError
will help in order to tell where we have more problems, but actions to take will be the same: automatic retries followed by automatic check of all replicas and retrasfer in Rucio.
Thanks @belforte for the feedback.
Just to summarize, the present proposal would be to
FileOpenError
into a new FileNotFound
exit codeFileReadError
into a new RemoteReadError
exit codeIn addition, just to make it clear (from @kpedro88's request), the FatalRootError
exit code should not get used for read errors, and we will (eventually) fix those to be reported as FileReadError
s.
While discussing with @kpedro88 I became to wonder the fate of FallbackFileOpenError
. To remind, here the "fallback" refers to the additional file catalogs defined in site-local-config.xml
after the "primary catalog". For such "fallback files", is it useful to know
FileNotFound
FileNotFound
wow ! Can you really differentiate "not found" from "file open error" ? And are/will you separate read errors of the "can't read basket" kind from "branch not found" kind?
I think we should keep this simple. I was about to propose a detailed classification scheme but.. in the end what matters is to identify problem enough for being able to decide which corrective action to take. When we end up with "verify checksum and possibly re-transfer", it matters little if problem was in open or read in the middle of the file. What would be great is to know from which RSE file was being read. At least if the site is local or remote. My favorite example is truncated file, xrootd does not say where it reads from. Only the global/regional redirector name is printed https://github.com/dmwm/CRABServer/issues/7548#issuecomment-1713880031
And reliable and easy-to-find information about the name of the file being read. Now my code in CRAB "scrolls the log up" looking for last "open", ugly and fragile.
That file was not found in primary catalog entry is not so important, there are many cases where jobs run at a site which does not have the input data local. We have outgrown considerably the initial "xrootd as fallback" and I do not see a turning back. And think about disk-less sites (we have some T3 like that already).
wow ! Can you really differentiate "not found" from "file open error" ?
At least in some cases we should be able to. E.g. in jobs that use our own TFileAdaptor
layer (which is the case by default), where we handle raw file IO instead of root we should be able to distinguish "not found" from other (presumably more often higher-level) "file open" errors. At least for local files it should be straightforward, and for xrootd this error with code 3011 "No servers are available to read the file" looks like a good candidate to be classified as "file not found".
I'm not sure though how reliably we could do that for all the protocols we register a handler in TFileAdaptor
. I'd expect many protocols to distinguish "not found" from other errors, but I also haven't taken a deep look. (then one can start asking questions like if we can't do this disambiguation reliably, should we try to do it at all? and do we still need all these protocols for which the handlers were developed 10+ years ago? etc)
And are/will you separate read errors of the "can't read basket" kind from "branch not found" kind?
This should already to large degree be the case. The "branch not found" should be
ProductNotFound
(8006) when the missing piece is some data product in Event/Lumi/RunEventCorruption
(8016) when framework metadata is missing
EventCorruption
does not describe well the situations when it is thrown...I think we should keep this simple.
I fully agree.
What would be great is to know from which RSE file was being read. At least if the site is local or remote. My favorite example is truncated file, xrootd does not say where it reads from. Only the global/regional redirector name is printed dmwm/CRABServer#7548 (comment)
This error message looks like something we might be able to improve (although it's also tricky, the layers above the xrootd do not know anything about the "actual file name", i.e. the PFN is the file name as far as they are concerned). Does that file still exist? (on the other hand it should be fairly easy to create a new truncated ROOT file)
And reliable and easy-to-find information about the name of the file being read. Now my code in CRAB "scrolls the log up" looking for last "open", ugly and fragile.
I think we should have something in the framework job report too, but that might be limited to LFN or PFN. In this case, are you after the PFN, or the "actual URL xrootd was reading from"?
That file was not found in primary catalog entry is not so important, there are many cases where jobs run at a site which does not have the input data local.
Ok, thanks. So if we would change the xrootd-related errors to be reported via a new RemoteReadError
exit code, could we then stop issuing FallbackFileOpenError
(8028) altogether?
indeed one can argue forever about "handling many/most while possibly obfuscating few" and get into a huge todo list with no major benefits. Let's stick to clear action items.
So if we would change the xrootd-related errors to be reported via a new RemoteReadError exit code, could we then stop issuing FallbackFileOpenError (8028) altogether?
IIUC the new error will cover both file open via xrootd adaptor failed
and and 'read error happened after file was successfully opened'. Can you tell a RemoteRead
from a local read on a site whjich uses xroot as the protocol to serve local data to WN's (EOS e.g.) ? I know that I sound like I am changing my mind here... it is a confusing topic, thanks for clarifying.
As to "which file gave this error", anything which contains a LFN is good. The actual xrootd URL with remove site indication would be even better, but may not make a dramatic difference.
Unrelated, quite possibly some protocols can be deprecated now. Facilities and SiteSupport (aka Stephan Lammel) is a good candidate to review the list.
P.S. my truncated file is still there. FWIW I simply used truncate
command in bash shell. Anyhow truncated files were mostly a "Florida thing" which has largely disappeared by now :man_shrugging:
Can you tell a
RemoteRead
from a local read on a site whjich uses xroot as the protocol to serve local data to WN's (EOS e.g.) ?
No, we would treat any read errors through xrootd as RemoteReadError
(and only when XrdAdaptor
is being used, this is the default but it is technically possible for a user to turn it off... but maybe we don't need consider such cases)
When diagnosing failures in grid jobs, precise exit code meanings are very valuable. Currently, several exit codes are overloaded or used (in my opinion) inconsistently.
One category of errors is "Fatal Root Error". This occurs when there is some kind of file corruption. Corrupt files have to be fixed centrally, so it is very important to be able to isolate these. Here are some examples of the exit codes that can occur with "Fatal Root Error" and related exception messages:
exit code 84:
exit code 85:
exit code 86:
Another category of errors are xrootd issues. These are usually transient, though they can indicate that a file is not accessible anywhere on disk. Here are some examples:
exit code 84:
exit code 85:
I propose reserving exit code 84 for xrootd file open error (indicating file missing from disk), exit code 85 for xroot file read error (transient), and exit code 86 for all fatal root errors (indicating file corruption). I am open to other proposals to resolve these ambiguities.