Stageout failures at sites using SRM with rhel8

haozturk commented 1 year ago

Dear experts,

We see that many production workflows have failed at sites which use SRM during stageout. With a small local test, we can see that calls using SRM within this container fails while it succeeds on lxplus:

[haozturk@lxplus708 ~]$ singularity shell --bind /cvmfs --bind /afs --contain --ipc --pid /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel8
Singularity> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/
gfal-ls error: 70 (Communication error on send) - srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://dcache-se-cms.desy.de:8443/srm/managerv2: CGSI-gSOAP running on lxplus708.cern.ch reports Error reading token data header: Connection closed

[haozturk@lxplus708 ~]$ gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
joroemer
tpook
...

This is an example production workflow which failed at T1_FR_CCIN2P3 during stageout

Can you please look into the issue w/ this container?

cmsbuild commented 1 year ago

A new Issue was created by @haozturk Hasan ztrk.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

assign core

cmsbuild commented 1 year ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

@aandvalenzuela @iarspider Do you know what to check, or do we need to wait for @smuzaffar to come back?

iarspider commented 1 year ago

@makortel we have to wait for @smuzaffar.

smuzaffar commented 1 year ago

@haozturk , I think problem is with OSG software stack e.g. running same command under opensciencegrid/osg-wn:3.5-el8 container also hangs and then fails

> singularity shell -B /home -B /tmp  --contain --ipc --pid docker://opensciencegrid/osg-wn:3.5-el8
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
gfal-ls error: 70 (Communication error on send) - srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://dcache-se-cms.desy.de:8443/srm/managerv2: CGSI-gSOAP running on cmsdev32.cern.ch reports Error reading token data header: Connection closed

I have rebuilt cmssw/cms:rhel8 to get the latest versions of packages and I noticed that gfal-ls works if /cvmfs is not mounted

> singularity shell -B /home -B /tmp  --contain --ipc --pid docker://cmssw/cms:tmp-rhel8-cms-20221009
INFO:    Using cached SIF image
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/

but if I mount /cvmfs then it fails

> singularity shell -B /home -B /tmp  -B /cvmfs --contain --ipc --pid docker://cmssw/cms:tmp-rhel8-cms-20221009
INFO:    Using cached SIF image
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
gfal-ls error: 70 (Communication error on send) - srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://dcache-se-cms.desy.de:8443/srm/managerv2: CGSI-gSOAP running on cmsdev32.cern.ch reports Error reading token data header: Connection closed

smuzaffar commented 1 year ago

note that cmssw/cms:rhel8 containers are based on opensciencegrid/osg-wn:3.5-el8

stlammel commented 1 year ago

Is this a side effect of OSG dropping GSI support (which SRM/gsi/gridftp use), i.e. do we need to decouple from the OSG worker node clients and use the EGI, WLCG , or our own tools for this? (I thought OSG 3.5 had GSI support but maybe this was removed for CentOS 8? Shall we ask OSG support to comment?)

Stephan

smuzaffar commented 1 year ago

No @stlammel , I think the GSI support was dropped only in OSG 3.6. @jblomer pointed out that there might be something which changes PATH/LD_LIBRARY_PATH when /cvmfs is mounted and I think the issue is with https://github.com/cms-sw/cms-docker/blob/master/cms/osg-wn-client-setup.sh script which sources /cvmfs/oasis.opensciencegrid.org/osg-software/osg-wn-client/3.5/current/el8-x86_64/setup.sh and changes the LD_LIBRARY_PATH. Note that this script is sourced when singularity is started. Looks like the software installed in the container and the packages available via /cvmfs/oasis.opensciencegrid.org/osg-software/osg-wn-client/3.5/current/el8-x86_64/setup.sh are not compatible.

stlammel commented 1 year ago

Hallo Shahzad, Thanks! We had this "mixed" and broken environment before and this was one of the reasons for using the OSG WN client environment. Looking at the setup script, it puts the OSG location ahead in the PATH/LD_LIBRARY_PATH as it should. The OSG 3.5 WN environment for CentOS 8 switched to python3 but /cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.62-1/el8-x86_64/usr/bin contains only python2. The python3 is picked up from the OS in the image and seems not to work/then mixes the gfal2 1.8/1.7 environments. I would ping OSG support. Thanks,

Stephan

haozturk commented 1 year ago

Thanks for following this issue. @stlammel did you contact OSG support? If not who should do it?

stlammel commented 1 year ago

Hallo Hasan @haozturk , no, i did not open a OSG ticket. I can if we agree this is the direction we want to go.

Stephan

haozturk commented 1 year ago

Thanks @stlammel I rely on your and @smuzaffar's judgement on this as I'm not an expert on the issue. I just want to highlight that more and more el8 workflows are coming to production and we're banning more than 30 sites for such workflows. This might increase the delivery time of the requests and lower the utilization of the banned sites. So, the sooner we fix it, the less trouble we'll have. I'm happy to do anything that I can do.

jenimal commented 1 year ago

Are there any downsides of doing this? if not can we make it happen ASAP?
Thanks, Jen

jenimal commented 1 year ago

@stlammel any followup? As Hasan says above, we are going to trust your expertise on this. We currently do not see the downside of switching so if there is one we need to know. Otherwise lets get moving on it.

Jen

stlammel commented 1 year ago

Yes, we decided to involve OSG. Hasan is in the loop/should get copies/updates. I can't give you a timeline though. I would expect at least several days. There is also a discussion about the origin of the OSG WN client use. Given that the issue was first encountered about a month ago, i would give it a few days for things to be better understood.

Stephan

cms-sw / cmssw

Stageout failures at sites using SRM with rhel8 #39591