Open haozturk opened 2 years ago
A new Issue was created by @haozturk Hasan ztrk.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
@aandvalenzuela @iarspider Do you know what to check, or do we need to wait for @smuzaffar to come back?
@makortel we have to wait for @smuzaffar.
@haozturk , I think problem is with OSG software stack e.g. running same command under opensciencegrid/osg-wn:3.5-el8
container also hangs and then fails
> singularity shell -B /home -B /tmp --contain --ipc --pid docker://opensciencegrid/osg-wn:3.5-el8
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
gfal-ls error: 70 (Communication error on send) - srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://dcache-se-cms.desy.de:8443/srm/managerv2: CGSI-gSOAP running on cmsdev32.cern.ch reports Error reading token data header: Connection closed
I have rebuilt cmssw/cms:rhel8
to get the latest versions of packages and I noticed that gfal-ls
works if /cvmfs
is not mounted
> singularity shell -B /home -B /tmp --contain --ipc --pid docker://cmssw/cms:tmp-rhel8-cms-20221009
INFO: Using cached SIF image
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
but if I mount /cvmfs
then it fails
> singularity shell -B /home -B /tmp -B /cvmfs --contain --ipc --pid docker://cmssw/cms:tmp-rhel8-cms-20221009
INFO: Using cached SIF image
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
gfal-ls error: 70 (Communication error on send) - srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://dcache-se-cms.desy.de:8443/srm/managerv2: CGSI-gSOAP running on cmsdev32.cern.ch reports Error reading token data header: Connection closed
note that cmssw/cms:rhel8
containers are based on opensciencegrid/osg-wn:3.5-el8
Is this a side effect of OSG dropping GSI support (which SRM/gsi/gridftp use), i.e. do we need to decouple from the OSG worker node clients and use the EGI, WLCG , or our own tools for this? (I thought OSG 3.5 had GSI support but maybe this was removed for CentOS 8? Shall we ask OSG support to comment?)
No @stlammel , I think the GSI support was dropped only in OSG 3.6. @jblomer pointed out that there might be something which changes PATH/LD_LIBRARY_PATH when /cvmfs
is mounted and I think the issue is with https://github.com/cms-sw/cms-docker/blob/master/cms/osg-wn-client-setup.sh script which sources /cvmfs/oasis.opensciencegrid.org/osg-software/osg-wn-client/3.5/current/el8-x86_64/setup.sh
and changes the LD_LIBRARY_PATH
. Note that this script is sourced when singularity is started. Looks like the software installed in the container and the packages available via /cvmfs/oasis.opensciencegrid.org/osg-software/osg-wn-client/3.5/current/el8-x86_64/setup.sh
are not compatible.
Hallo Shahzad, Thanks! We had this "mixed" and broken environment before and this was one of the reasons for using the OSG WN client environment. Looking at the setup script, it puts the OSG location ahead in the PATH/LD_LIBRARY_PATH as it should. The OSG 3.5 WN environment for CentOS 8 switched to python3 but /cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.62-1/el8-x86_64/usr/bin contains only python2. The python3 is picked up from the OS in the image and seems not to work/then mixes the gfal2 1.8/1.7 environments. I would ping OSG support. Thanks,
Thanks for following this issue. @stlammel did you contact OSG support? If not who should do it?
Hallo Hasan @haozturk , no, i did not open a OSG ticket. I can if we agree this is the direction we want to go.
Thanks @stlammel I rely on your and @smuzaffar's judgement on this as I'm not an expert on the issue. I just want to highlight that more and more el8 workflows are coming to production and we're banning more than 30 sites for such workflows. This might increase the delivery time of the requests and lower the utilization of the banned sites. So, the sooner we fix it, the less trouble we'll have. I'm happy to do anything that I can do.
Are there any downsides of doing this? if not can we make it happen ASAP?
Thanks,
Jen
@stlammel any followup? As Hasan says above, we are going to trust your expertise on this. We currently do not see the downside of switching so if there is one we need to know. Otherwise lets get moving on it.
Jen
Yes, we decided to involve OSG. Hasan is in the loop/should get copies/updates. I can't give you a timeline though. I would expect at least several days. There is also a discussion about the origin of the OSG WN client use. Given that the issue was first encountered about a month ago, i would give it a few days for things to be better understood.
Dear experts,
We see that many production workflows have failed at sites which use SRM during stageout. With a small local test, we can see that calls using SRM within this container fails while it succeeds on lxplus:
This is an example production workflow which failed at T1_FR_CCIN2P3 during stageout
Can you please look into the issue w/ this container?