XrdRequestManager sometimes blocks all available sources to remote files, causing read errors from EOS

amadio commented 10 months ago

This is an issue that has been regularly observed in EOS CMS, but also happened recently on EOS public with users trying to access CMS open data (see INC3616797).

CMSSW's XrdRequestManager has a mechanism to mark slow sources "bad" when reading remote files. This shows up in EOS logs as shown below, where at the end of each line you can see info=eos.app=cmst0&tried=... to indicate that "tried" sources should not be considered.

231016 16:40:40 time=1697467240.260615 func=open                     level=INFO  logid=fa3da3dc-6c31-11ee-b552-a4bf0179624c unit=mgm@eoscms-ns-ip563.cern.ch:1094 tid=00007fa580bfd700 source=XrdMgmOfsFile:551              tident=cmsplt01.683:2522@b7g20p0453 sec=gsi   uid=103031 gid=1399 name=cmst0 geo="" op=read path=/eos/cms/tier0/store/hidata/HIRun2023A/HIPhysicsRawPrime12/RAW/v1/000/375/055/00000/debd4651-2d60-47b4-a6e7-b90e87113db8.root info=eos.app=cmst0&tried=st-096-gg5618jy.cern.ch,p06636710g40375.cern.ch&triedrc=resel
231016 16:46:36 time=1697467596.650015 func=open                     level=INFO  logid=ceaa5980-6c32-11ee-91c5-a4bf0179624c unit=mgm@eoscms-ns-ip563.cern.ch:1094 tid=00007fa7beffd700 source=XrdMgmOfsFile:551              tident=cmsplt01.683:2522@b7g20p0453 sec=gsi   uid=103031 gid=1399 name=cmst0 geo="" op=read path=/eos/cms/tier0/store/hidata/HIRun2023A/HIPhysicsRawPrime12/RAW/v1/000/375/055/00000/debd4651-2d60-47b4-a6e7-b90e87113db8.root info=eos.app=cmst0&tried=st-096-gg5618jy.cern.ch,p06636710g40375.cern.ch&triedrc=resel
231016 16:48:52 time=1697467732.777665 func=open                     level=INFO  logid=1fcdc73e-6c33-11ee-af98-a4bf0179624c unit=mgm@eoscms-ns-ip563.cern.ch:1094 tid=00007f907a5fe700 source=XrdMgmOfsFile:551              tident=cmsplt01.683:2522@b7g20p0453 sec=gsi   uid=103031 gid=1399 name=cmst0 geo="" op=read path=/eos/cms/tier0/store/hidata/HIRun2023A/HIPhysicsRawPrime12/RAW/v1/000/375/055/00000/debd4651-2d60-47b4-a6e7-b90e87113db8.root info=eos.app=cmst0&tried=st-096-gg5618jy.cern.ch,p06636710g40375.cern.ch&triedrc=resel
231016 16:52:27 time=1697467947.350690 func=open                     level=INFO  logid=9fb3063a-6c33-11ee-9fd5-a4bf0179624c unit=mgm@eoscms-ns-ip563.cern.ch:1094 tid=00007f92c83fe700 source=XrdMgmOfsFile:551              tident=cmsplt01.683:2522@b7g20p0453 sec=gsi   uid=103031 gid=1399 name=cmst0 geo="" op=read path=/eos/cms/tier0/store/hidata/HIRun2023A/HIPhysicsRawPrime12/RAW/v1/000/375/055/00000/debd4651-2d60-47b4-a6e7-b90e87113db8.root info=eos.app=cmst0&tried=st-096-gg5618jy.cern.ch,p06636710g40375.cern.ch&triedrc=resel
231016 16:56:11 time=1697468171.243226 func=open                     level=INFO  logid=252650ba-6c34-11ee-bd8c-a4bf0179624c unit=mgm@eoscms-ns-ip563.cern.ch:1094 tid=00007f8984ff9700 source=XrdMgmOfsFile:551              tident=cmsplt01.683:2522@b7g20p0453 sec=gsi   uid=103031 gid=1399 name=cmst0 geo="" op=read path=/eos/cms/tier0/store/hidata/HIRun2023A/HIPhysicsRawPrime12/RAW/v1/000/375/055/00000/debd4651-2d60-47b4-a6e7-b90e87113db8.root info=eos.app=cmst0&tried=st-096-gg5618jy.cern.ch,p06636710g40375.cern.ch&triedrc=resel
231016 16:59:20 time=1697468360.811433 func=open                     level=INFO  logid=96242044-6c34-11ee-95dc-a4bf0179624c unit=mgm@eoscms-ns-ip563.cern.ch:1094 tid=00007fa59b3ff700 source=XrdMgmOfsFile:551              tident=cmsplt01.683:2522@b7g20p0453 sec=gsi   uid=103031 gid=1399 name=cmst0 geo="" op=read path=/eos/cms/tier0/store/hidata/HIRun2023A/HIPhysicsRawPrime12/RAW/v1/000/375/055/00000/debd4651-2d60-47b4-a6e7-b90e87113db8.root info=eos.app=cmst0&tried=st-096-gg5618jy.cern.ch,p06636710g40375.cern.ch&triedrc=resel

However, as can be seen above, two different FSTs are being blocked in the request, and EOS CMS has two replicas, so both available sources are blocked, which leads to reading errors from EOS. This feature also causes a lot of open errors and extra load on the instance, as CMSSW tries to open all sources by requesting to open until something fails. This can be seen in the plot below (note this happens mostly on EOS CMS instance):

cmssw-open-errors

For erasure coded areas, this feature has caused more problems because there are many pieces of the source to be opened, so a workaround has been put on EOS to ignore blocking of sources for erasure coded files in https://github.com/cern-eos/eos/commit/31f14fdea51914a23c2bbb6677857ba440aaacce.

Therefore, as has been discussed before (see meeting minutes in https://indico.cern.ch/event/1271977/), it would be good to have a way to disable this mechanism, or even make it disabled by default and only enable it when requested by the user. While this feature is quite useful to choose between different distributed data sources, for EOS specifically, since all sources are likely in the same data center at CERN, the costs (reading errors) outweigh the potential benefits.

Potentially related issues: #39050, #42056.

cmsbuild commented 10 months ago

A new Issue was created by @amadio Guilherme Amadio.

@Dr15Jones, @antoniovilela, @makortel, @sextonkennedy, @smuzaffar, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

smuzaffar commented 10 months ago

assign core

cmsbuild commented 10 months ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 10 months ago

@bbockelm comments?

makortel commented 10 months ago

(see meeting minutes in https://indico.cern.ch/event/1271977/)

I get "Access Denied".

mariayh1 commented 10 months ago

Indeed CMS has a specific group for this meetings, so maybe not everyone is there. Let me show paste the minutes here:

_CMSSW optimization removing storage servers in EOS system is increasing the request load. This has been reported several times but this is impacting more with the load of CMS analysis. This feature should be removed:[https://github.com/cms-sw/cmssw/blob/6ba4223dda3848469ea1597fd8ec8d3cdacde6a6/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L207-L266] (https://github.com/cms-sw/cmssw/blob/6ba4223dda3848469ea1597fd8ec8d3cdacde6a6/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L207-L266). The issue is that the CMSSW software ban all the servers that can provide the file requested. This will result in errors since there is no other place where to read it. Patch done to workaround the CMSSW optimization for ONLY EC: https://gitlab.cern.ch/dss/eos/-/commit/31f14fdea51914a23c2bbb6677857ba440aaacce.

CMS will discuss internally to solve this issue.
EOS will provide some statistics + some recent cms user examples impacted by this issue:
    Recently issue reported from CMST0 confirmed the situationhttps://mattermost.web.cern.ch/it-dep/pl/xce7ermz8prgibcep6wzccbsbe.
    Recent opendata issue affecting cms users: https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC3616797_

makortel commented 10 months ago

I went through some earlier email threads on this topic of CMSSW's "multi-source xrootd" vs. EOS Erasure Coding. I'll tag @stlammel and @drkovalskyi as they were part of the thread around end-of-March and beginning-of-April in 2022 (there were very differing opinions of what exactly should be fixed and where). I'll also include @dan131riley.

In that thread @bbockelm commented (I take the liberty of copying it here)

Can someone provide the CMSSW-side of the failure? I only saw the EOS-side below and didn't spot any CMSSW stdout from the reported T0 problem on the JIRA tickets.

I feel the CMSSW-side timeouts are generous -- they're something around 30 or 60 seconds. Does it really take a minute to assemble up to 1024 read requests?!?

It's important to realize there's two things here:

CMSSW will natively "look around" for alternate sources of a file every few minutes. If this fails, there's no problem -- not even sure it's recorded in the client stdout.

If there's a hard failure -- the Xrootd request completely failed due, say, to a timeout, CMSSW will try to reopen the file one last time and see if it's able to save the job.

If we're saying that it's normal and expected that a Read or a ReadV request takes more than a minute to respond, one can increase the timeout. However, it seems like we should verify that indeed is the problem first.

The question of "is it normal for read requests to take more than a minute" remained unanswered in that thread. Does anyone here have an idea how normal that is?

For testing purposes the following in the job configuration file should disable the xrootd adaptor for root:// protocol (I didn't test)

process.AdaptorConfig = cms.Service("AdaptorConfig",
    native = cms.untracked.vstring("root") # only for root: protocol
# or
#    native = cms.untracked.vstring("all") # for all protocols
)

(in the email thread some tests were discussed, but were apparently not really followed up)

I think the timeouts can be increased in the job configuration via

process.SiteLocalConfigService = cms.Service("SiteLocalConfigService",
   overrideSourceTimeout = cms.untracked.uint32(<value>) # XrdAdaptor uses 3*60 unless non-zero value is specified here
)

or in the site-local-config.xml via

<site-local-config>
  <site name="<site name>">
    ...
    <source-config>
      ...
      <timeout-in-seconds value="<value>"/>
    </source-config>
  </site>
</site-local-config>

Maybe we'd need some specific tests to see the impact of these settings?

stlammel commented 10 months ago

How long does EOS wait before it considers a connection dead and cleans it up? We had an issue at another site that was cleaning up thought-to-be-unused xrootd connections too quickly for some CPU intensive jobs. If i recall, a 15 minutes wait was insufficient.

Stephan

drkovalskyi commented 10 months ago

I have a feeling EOS tends to disconnect very quickly. When I run jobs interactively at CERN using xrootd as input for data on EOS I often see disconnect errors like that:

[Error ][PostMaster ][11921] [st-096-gg5003qo.cern.ch:1095] Forcing error on disconnect: [ERROR] Operation interrupted.

I get more of them if I pause the job, but I haven't measured how long it was paused and how many errors I got after that. All these errors seem harmless and I don't think they cause failures. At least it doesn't happen often enough for me to see a clear correlation.

bbockelm commented 4 hours ago

Met with @amadio in person to discuss this. I think what's happening here is a misinterpretation of the logs.

This appears to be the slow-source scanning algorithm. If EOS returns a failure for slow-source scanning, the failure is ignored by the client and CMSSW continues happily.

However, the problem we discussed is that, from the EOS point-of-view, it's impossible to distinguish between "slow source scanning" and "exclusion due to failure". EOS operations can ignore the former while are interested in the latter; however, since they're identical, the EOS monitoring is confounding.

Two ideas we brainstormed:

Add an extra parameter to the URL indicating this is an "advisory open" which is permitted to fail without a problem. This would show up in the EOS log and can be filtered out in dashboards.
If all sources are excluded, then instead of failing, return a randomly-selected source. If CMSSW is in "exclusion due to failure" mode, it'll throw an exception and exit. If it's in "slow source scanning" mode, it'll ignore the failure and continue.

I think (1) helps the intent come through and gives operations a better understanding. However, it requires changes on both the EOS and CMSSW side (and rollout of a new CMSSW version). The advantage of (2) is it can be quickly pushed out by EOS and instantly be taken advantage of by CMSSW.

cmsbuild commented 4 hours ago

cms-bot internal usage

makortel commented 2 hours ago

Thanks @bbockelm. Would it be feasible to do both? I.e. (2) as an immediately-deployable workaround, and (1) as a long term solution?

bbockelm commented 2 hours ago

I think both can be done as they are complimentary.

For (1), I think all that's needed is to have CMS and EOS agree on a parameter name. If we need a strawman, I'd propose advisory-open=true

For (2), it's all on EOS. I'll defer to @amadio.

makortel commented 8 minutes ago

Add an extra parameter to the URL indicating this is an "advisory open" which is permitted to fail without a problem. This would show up in the EOS log and can be filtered out in dashboards.

... I think (1) helps the intent come through and gives operations a better understanding.

For (1), I think all that's needed is to have CMS and EOS agree on a parameter name. If we need a strawman, I'd propose advisory-open=true

@bbockelm Do I read the XrdAdaptor code correctly that this new parameter in the URL would need get added in XrdAdaptor::RequestManager::OpenHandler::open() https://github.com/cms-sw/cmssw/blob/c312b85e0474fb1c058f8e2c8e0aadcf6b69c30d/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L1178 that then gets passed to https://github.com/cms-sw/cmssw/blob/c312b85e0474fb1c058f8e2c8e0aadcf6b69c30d/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L1190

but only if the XrdAdaptor::RequestManager::OpenHandler::open() is called from XrdAdaptor::RequestManager::checkSourcesImpl() https://github.com/cms-sw/cmssw/blob/c312b85e0474fb1c058f8e2c8e0aadcf6b69c30d/Utilities/XrdAdaptor/src/XrdRequestManager.cc#L523-L526

?

cms-sw / cmssw

XrdRequestManager sometimes blocks all available sources to remote files, causing read errors from EOS #43162