art-framework-suite / art-root-io

0 stars 2 forks source link

match art/root behavior to dCache #3

Open knoepfel opened 2 years ago

knoepfel commented 2 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/23167 (FNAL account required) Originally created by @rlcee on 2019-08-23 21:57:54


We are having trouble with xrootd and the solution may be partailly in art. In observations of log files and discussion with dCache experts, we have three cases:

  1. the file is in tape-backed dCache and not on disk at the moment of the request. In this case, dCache returns via xrootd a code that that indicates this state. They say a user should wait and retry, but we saw that root/art aborts immediately. (ifdh and nfs block, so it isn't an issue there.)
  2. if a server is overloaded and the request goes in a dCache queue, after 30s, it will return an error (see below). In this case the right thing to do is retry for a while.
  3. there are transient errors (we've see mysterious DNS errors), and there should be retries.

In previous discussions with Kyle and Philippe, we had concluded that root should currently be configured to retry many times for perhaps an hour. If I understood, this does not happen because art catches the error and treats all non-info messages as fatal.

As I recall, the error returned in the case of the file on tape and not staged, the return code was special, "resource not available", so it could be recognized and treated properly - wait and retry. Ideally xrootd would just block.

The error returned in the overloaded case is essentially "file not found" (see below) so it is not distinguishable from an actual missing file. Ideally we can get those separated so we can treat them differently and more ideally, it would block for longer than 30s.

The transient error case should be handled with at least a few retires.

22-Aug-2019 18:04:35 UTC  Initiating request to open input file 
"xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/mu2e/persistent/users/
mu2epro/workflow/MDC2018_DS-cosmic-mix_i_0/good/22585566.00/00/00145/
dig.mu2e.DS-cosmic-mix.MDC2018i.001002_00000780.art"

%MSG-s ArtException:  PostEndJob 22-Aug-2019 18:05:01 UTC ModuleEndJob
cet::exception caught in art
---- OtherArt BEGIN
---- FileOpenError BEGIN
RootInputFileSequence::initFile(): Input file 
xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/mu2e/persistent/users/mu2epro/
workflow/MDC2018_DS-cosmic-mix_i_0/good/22585566.00/00/00145/
dig.mu2e.DS-cosmic-mix.MDC2018i.001002_00000780.art was not found or could not be opened.
knoepfel commented 2 years ago

Comment by @knoepfel on 2019-09-04 14:41:18


This will take some investigation. How does this relate to Redmine issue 21638?