We are having trouble with xrootd and the solution may be
partailly in art. In observations of log files and discussion with
dCache experts, we have three cases:
the file is in tape-backed dCache and not on disk at the moment
of the request. In this case, dCache returns via xrootd a code that
that indicates this state. They say a user should wait and retry,
but we saw that root/art aborts immediately. (ifdh and nfs block,
so it isn't an issue there.)
if a server is overloaded and the request goes in a dCache
queue, after 30s, it will return an error (see below). In this case the
right thing to do is retry for a while.
there are transient errors (we've see mysterious DNS errors),
and there should be retries.
In previous discussions with Kyle and Philippe, we had concluded
that root should currently be configured to retry many times
for perhaps an hour. If I understood, this does not happen
because art catches the error and treats all non-info messages as fatal.
As I recall, the error returned in the case of the file on tape
and not staged, the return code was special, "resource not available",
so it could be recognized and treated properly - wait and retry. Ideally
xrootd would just block.
The error returned in the overloaded case is essentially "file not found"
(see below) so it is not distinguishable from an actual missing file.
Ideally we can get those separated so we can treat them differently
and more ideally, it would block for longer than 30s.
The transient error case should be handled with at least a few retires.
22-Aug-2019 18:04:35 UTC Initiating request to open input file
"xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/mu2e/persistent/users/
mu2epro/workflow/MDC2018_DS-cosmic-mix_i_0/good/22585566.00/00/00145/
dig.mu2e.DS-cosmic-mix.MDC2018i.001002_00000780.art"
%MSG-s ArtException: PostEndJob 22-Aug-2019 18:05:01 UTC ModuleEndJob
cet::exception caught in art
---- OtherArt BEGIN
---- FileOpenError BEGIN
RootInputFileSequence::initFile(): Input file
xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/mu2e/persistent/users/mu2epro/
workflow/MDC2018_DS-cosmic-mix_i_0/good/22585566.00/00/00145/
dig.mu2e.DS-cosmic-mix.MDC2018i.001002_00000780.art was not found or could not be opened.
This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/23167 (FNAL account required) Originally created by @rlcee on 2019-08-23 21:57:54
We are having trouble with xrootd and the solution may be partailly in art. In observations of log files and discussion with dCache experts, we have three cases:
In previous discussions with Kyle and Philippe, we had concluded that root should currently be configured to retry many times for perhaps an hour. If I understood, this does not happen because art catches the error and treats all non-info messages as fatal.
As I recall, the error returned in the case of the file on tape and not staged, the return code was special, "resource not available", so it could be recognized and treated properly - wait and retry. Ideally xrootd would just block.
The error returned in the overloaded case is essentially "file not found" (see below) so it is not distinguishable from an actual missing file. Ideally we can get those separated so we can treat them differently and more ideally, it would block for longer than 30s.
The transient error case should be handled with at least a few retires.