cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

Xrd Retrial Improvement #46086

Open bockjoo opened 2 months ago

bockjoo commented 2 months ago

At line 245 in Utilities/XrdAdaptor/src/XrdRequestManager.cc, adding this m_disabledExcludeStrings.insert(dataServer); // proposed by bockjoo like so if (!dataServer.empty()) { ex.addAdditionalInfo("Problematic data server: " + dataServer); m_disabledExcludeStrings.insert(dataServer); // proposed by bockjoo } can improve in reducing the Xrd federation file open errors.

cmsbuild commented 2 months ago

cms-bot internal usage

cmsbuild commented 2 months ago

A new Issue was created by @bockjoo.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

bockjoo commented 2 months ago

This is a reproducer CMSSW config:

import FWCore.ParameterSet.Config as cms

process = cms.Process('NoSplit')

process.source = cms.Source("PoolSource",
                            fileNames = cms.untracked.vstring(
                                # ERROR Generation
                                #'/store/mc/RunIIFall17GS/MinBias_TuneCP5_inelasticON_13TeV-pythia8/GEN-SIM/93X_mc2017_realistic_v3-v1/30000/5C4FD8FA-71C1-E711-8D54-001E67247CC9.root',
                                # SHOULD work
                                #'/store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/2550003/86acbeb4-c085-4ebd-87ee-25d1ec0643bf.root',
                                # SHOULD work
                                #'root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/2550003/86acbeb4-c085-4ebd-87ee-25d1ec0643bf.root',
                                # NCHC nchc
                                'root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root',
                            )
)
process.maxEvents = cms.untracked.PSet(input = cms.untracked.int32(10))
process.options = cms.untracked.PSet(wantSummary = cms.untracked.bool(True))
process.output = cms.OutputModule("PoolOutputModule",
    outputCommands = cms.untracked.vstring("drop *", "keep recoTracks_*_*_*"),
    fileName = cms.untracked.string('output.root'),
)
process.out = cms.EndPath(process.output)

Testing it with CMSSW_13_3_1_patch1

With the above mentioned addition of m_disabledExcludeStrings.insert(dataServer);

, the new_filename will look like: new_filename root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutri no_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root ?tried=,se01.grid.nchc.org.tw

For-loop for (int idx = 0; idx < retries; idx++) { } as is does not work for this file.

The file is at T2_TW_NCHC and T2_CH_CERN:

xrdfs xrootd-cms.infn.it:1094 locate -m /store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root
se01.grid.NCHC.org.tw:11001 Server Read
eoscms-ns-ip563.cern.ch:1098 Server ReadWrite 
eoscms-ns-ip700.cern.ch:1098 Server ReadWrite 

and at T1_US_FNAL_Disk:

xrdfs cmsxrootd.fnal.gov:1094 locate -m /store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root
cmsxrootd-site3.fnal.gov:1094 Manager Read
cmsxrootd-site2.fnal.gov:1094 Manager Read
cmssrv608.fnal.gov:1094 Manager Read
cmsxrootd-site1.fnal.gov:1094 Manager Read
But the for-loop never reaches beyond T2_TW_NCHC site.

Adding m_disabledExcludeStrings.insert(dataServer);

seems to be the simplest solution in improving the Xrd File Open failure in this case.

makortel commented 2 months ago

assign core

cmsbuild commented 2 months ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 2 months ago

Adding @bbockelm

There seems to be some relation to https://github.com/cms-sw/cmssw/issues/42056

makortel commented 2 months ago

I ran the "reproducer configuration" in 14_2_0_pre1. Here is what happens (this may be mostly for myself)

The retry loop stops because the same DataServer was returned twice https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L251-L255 (on the first loop iteration the aforementioned DataServer was added to m_disabledSourceStrings).


Some questions

makortel commented 2 months ago

I can confirm that adding m_disabledExcludeStrings.insert(dataServer); to https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L244-L246 causes the job to succeed. In this case the retry open attempt is for file root://xrootd-cms.infn.it//store/mc/Run3Summer21PrePremix/Neutrino_E-10_gun/PREMIX/Summer22_124X_mcRun3_2022_realistic_v11-v2/40004/1a5780bf-2132-4060-a573-66b4a07dc7ac.root?tried=,se01.grid.nchc.org.tw.

However, we don't seem to have a good (or any?) way to test these things, and I'm concerned if this change could have unwanted effects on some other xrootd systems (e.g. CERN EOS, either with or without erasure coding).

makortel commented 2 months ago

The existing code to set the excludeString (that would end up in the m_disabledExcludeStrings) is called from https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L227 and looks like https://github.com/cms-sw/cmssw/blob/0477f06c9e2175d7299022b613549b48034d02e7/Utilities/XrdAdaptor/src/XrdSource.cc#L290-L304

Underneath the isDCachePool() ends up checking if the URL contains a org.dcache.uuid parameter. In this case hostList->size() == 3, and additionally the isDCachePool() returns false because the aforementioned LastURL does not contain org.dcache.uuid parameter, whereas the URL in the Got invalid redirection URL error message does contain the parameter.

If the condition above would evaluate to true, the se01.grid.nchc.org.tw would likely end up in the excludeString.

bockjoo commented 2 months ago

By the way, T2_TW_NCHC site fixed a DNS issue ( https://ggus.eu/index.php?mode=ticket_info&ticket_id=168267 ) , the behavior of the file open will be different. Just FYI.

makortel commented 2 months ago

Code archaeology on why we construct the exclude string only for a specific dCache case pointed only to https://github.com/cms-sw/cmssw/pull/6592, which doesn't help much.

bockjoo commented 2 months ago

I don't think the 'exclude' string would hurt anything. In a given time, if a host is not behaving, it can be added to the exclude to jump to a next available host until the list is exhausted. But I am not an xrootd expert. I am just an experienced user.