art-framework-suite / art-root-io

0 stars 2 forks source link

What is art file open retry behavior? #2

Open knoepfel opened 2 years ago

knoepfel commented 2 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/21638 (FNAL account required) Originally created by @rlcee on 2019-01-08 16:19:25


In some recent jobs, I was reading a list of 8 art files on input to an art (v2_11_05) exe. I have 40 jobs running on 8 files each, and 38 finished OK. Two appear to hang on a file open. I asked dCache what they saw, and they said their logs show a series of quick connects followed by disconnects. I can get more information or you can join INC000001011024 and ask questions. For now I'd like to ask what I should expect from RootInput in terms of retries, at the art and root layers, and whether you might recognize this behavior. Also what might be done for logging any retries on the art side.

For now, this is just a request to confirm what art behavior is expected. I have to correlate that with dCache behavior and figure out what is failing. Eventually, we might ask for additional retry behavior (dCache predicted a retry would work in this case). When copying files to disk, we can do retries, but due to FermiGrid disk contention which will not be fixed anytime soon, we are being forced into using xroot streaming file access more, so that must have high reliability.

knoepfel commented 2 years ago

Comment by @knoepfel on 2019-01-08 17:33:39


I do not believe you are seeing anything art-specific. I will talk with Philippe to figure out what is the expected behavior from ROOT.

knoepfel commented 2 years ago

Comment by @knoepfel on 2019-01-09 22:27:08


The system.rootrc file included with the ROOT distributions we provide includes the following parameters:


# NetXNG.ConnectionWindow     - A time window for the connection establishment. A
#                               connection failure is declared if the connection
#                               is not established within the time window. If a
#                               connection failure happens earlier then another
#                               connection attempt will only be made at the
#                               beginning of the next window.
NetXNG.ConnectionWindow: 30

# NetXNG.ConnectionRetry      - Number of connection attempts that should be
#                               made (number of available connection windows)
#                               before declaring a permanent failure.
NetXNG.ConnectionRetry: 4096

# NetXNG.RequestTimeout       - Default value for the time after which an error
#                               is declared if it was impossible to get a
#                               response to a request.
NetXNG.RequestTimeout: 14400

# NetXNG.RedirectLimit        - Maximum number of allowed redirections.
NetXNG.RedirectLimit: 64

Unless a .rootrc file is provided by the user that overrides these defaults, or unless the XRD_* environment variables are set, these settings will be used by ROOT. Up to 4096 connection attempts are allowed with a timeout of 4 hours (14,400 seconds). It may be that the 4-hour timeout is much too long for your case, in which case it should be overridden. This can be done by setting the XRD_REQUESTTIMEOUT environment variable to the desired number of seconds.

Chris Green (on the watchers list) recalls that the bulk of the XRootD errors he encountered were due to authentication timeouts, and the timeout window for authentication was not configurable. We're not aware if this is still the case. Chris, please correct me where necessary.

knoepfel commented 2 years ago

Comment by @knoepfel on 2019-01-14 16:27:54


Ray, is the information above sufficient for you to proceed?

knoepfel commented 2 years ago

Comment by @rlcee on 2019-01-14 17:10:14


Thanks, that's very useful information. Do you know if the failures before a connect can be logged? I'd like to log every failed attempt, including the error. Then I think that's all I need out of this ticket.