Open bockjoo opened 2 months ago
cms-bot internal usage
A new Issue was created by @bockjoo.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
This is a reproducer CMSSW config:
import FWCore.ParameterSet.Config as cms
process = cms.Process('NoSplit')
process.source = cms.Source("PoolSource",
fileNames = cms.untracked.vstring(
# ERROR Generation
#'/store/mc/RunIIFall17GS/MinBias_TuneCP5_inelasticON_13TeV-pythia8/GEN-SIM/93X_mc2017_realistic_v3-v1/30000/5C4FD8FA-71C1-E711-8D54-001E67247CC9.root',
# SHOULD work
#'/store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/2550003/86acbeb4-c085-4ebd-87ee-25d1ec0643bf.root',
# SHOULD work
#'root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/2550003/86acbeb4-c085-4ebd-87ee-25d1ec0643bf.root',
# NCHC nchc
'root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root',
)
)
process.maxEvents = cms.untracked.PSet(input = cms.untracked.int32(10))
process.options = cms.untracked.PSet(wantSummary = cms.untracked.bool(True))
process.output = cms.OutputModule("PoolOutputModule",
outputCommands = cms.untracked.vstring("drop *", "keep recoTracks_*_*_*"),
fileName = cms.untracked.string('output.root'),
)
process.out = cms.EndPath(process.output)
Testing it with
CMSSW_13_3_1_patch1
With the above mentioned addition of
m_disabledExcludeStrings.insert(dataServer);
, the new_filename will look like:
new_filename root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutri no_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root ?tried=,se01.grid.nchc.org.tw
For-loop
for (int idx = 0; idx < retries; idx++) { }
as is does not work for this file.
The file is at T2_TW_NCHC and T2_CH_CERN:
xrdfs xrootd-cms.infn.it:1094 locate -m /store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root
se01.grid.NCHC.org.tw:11001 Server Read
eoscms-ns-ip563.cern.ch:1098 Server ReadWrite
eoscms-ns-ip700.cern.ch:1098 Server ReadWrite
and at T1_US_FNAL_Disk:
xrdfs cmsxrootd.fnal.gov:1094 locate -m /store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root
cmsxrootd-site3.fnal.gov:1094 Manager Read
cmsxrootd-site2.fnal.gov:1094 Manager Read
cmssrv608.fnal.gov:1094 Manager Read
cmsxrootd-site1.fnal.gov:1094 Manager Read
But the for-loop never reaches beyond T2_TW_NCHC site.
Adding
m_disabledExcludeStrings.insert(dataServer);
seems to be the simplest solution in improving the Xrd File Open failure in this case.
assign core
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
Adding @bbockelm
There seems to be some relation to https://github.com/cms-sw/cmssw/issues/42056
I ran the "reproducer configuration" in 14_2_0_pre1. Here is what happens (this may be mostly for myself)
[Error ][XRootD ] [se01.grid.nchc.org.tw:1094] Got invalid redirection URL: 2001:e10:6040:132:0:0:0:8?org.dcache.xrootd.client=mkortela.1404966@cmsdev42.cern.ch&org.dcache.uuid=25239b47-fc00-40d0-ab3f-8d97f30616fb?org.dcache.door=2001:e10:4000:122:0:0:0:105:1094
DataServer
is se01.grid.nchc.org.tw:1094
LastURL
is root://se01.grid.nchc.org.tw:1094//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root?xrdcl.requuid=9e75143d-d891-4c84-bccd-8f797ac46295
root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root?tried=
Got invalid redirection URL
DataServer
is se01.grid.nchc.org.tw:1094
LastURL
is root://se01.grid.nchc.org.tw:1094//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root?tried=&xrdcl.requuid=cddbaf61-e49b-4e7a-9253-a5c7db8387ef
The retry loop stops because the same DataServer
was returned twice
https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L251-L255
(on the first loop iteration the aforementioned DataServer
was added to m_disabledSourceStrings
).
Some questions
Got invalid redirection URL
mean?excludeString
to m_disabledExcludeStrings
(leading to empty entries in the tried=
argument in the URL) in https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L258I can confirm that adding m_disabledExcludeStrings.insert(dataServer);
to
https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L244-L246
causes the job to succeed. In this case the retry open attempt is for file root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root?tried=,se01.grid.nchc.org.tw
.
However, we don't seem to have a good (or any?) way to test these things, and I'm concerned if this change could have unwanted effects on some other xrootd systems (e.g. CERN EOS, either with or without erasure coding).
The existing code to set the excludeString
(that would end up in the m_disabledExcludeStrings
) is called from
https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L227
and looks like
https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdSource.cc#L290-L304
Underneath the isDCachePool()
ends up checking if the URL contains a org.dcache.uuid
parameter. In this case hostList->size() == 3
, and additionally the isDCachePool()
returns false
because the aforementioned LastURL
does not contain org.dcache.uuid
parameter, whereas the URL in the Got invalid redirection URL
error message does contain the parameter.
If the condition above would evaluate to true
, the se01.grid.nchc.org.tw
would likely end up in the excludeString
.
By the way, T2_TW_NCHC site fixed a DNS issue ( https://ggus.eu/index.php?mode=ticket_info&ticket_id=168267 ) , the behavior of the file open will be different. Just FYI.
Code archaeology on why we construct the exclude string only for a specific dCache case pointed only to https://github.com/cms-sw/cmssw/pull/6592, which doesn't help much.
I don't think the 'exclude' string would hurt anything. In a given time, if a host is not behaving, it can be added to the exclude to jump to a next available host until the list is exhausted. But I am not an xrootd expert. I am just an experienced user.
At line 245 in Utilities/XrdAdaptor/src/XrdRequestManager.cc, adding this
m_disabledExcludeStrings.insert(dataServer); // proposed by bockjoo
like soif (!dataServer.empty()) { ex.addAdditionalInfo("Problematic data server: " + dataServer); m_disabledExcludeStrings.insert(dataServer); // proposed by bockjoo }
can improve in reducing the Xrd federation file open errors.