Closed ageorget closed 12 months ago
I am unfamiliar with the ApplicationManager, but the logging I see here is not too helpful for diagnosing what the problem is in dCache. For instance, "Error You cannot open the ROOT file [root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1] in mode READ if it does not exists". does not look like a dCache error message.
If there is an authentication error here, perhaps an inspection of the dCache door log and dCache gPlazma log might reveal something else.
Offhand, a certificate with SANs which are hostnames (not IPs) should work with dCache xroot (and the xroot client).
Sorry, I missed your comment
Nothing is logged in door logs so I think the error is raised directly by the client.
[FATAL] Auth failed
is indeed an xrootd copy client message, but something must be happening on the server end for it to fail.
Is it possible to get more debug-level logging out of the client similar to xrdcp -d 3
?
What is RootDatabase
?
Are permissions on it keyed to a domain name?
Hi Albert,
What is
RootDatabase
?Are permissions on it keyed to a domain name?
This is on Atlas HammerCloud side, not related to dCache.
With this certificate installed on ccdcacli432 CMS SAM tests and ATLAS HC tests are OK :
Issuer: C=NL, O=GEANT Vereniging, CN=GEANT eScience SSL CA 4
Validity
Not Before: Dec 6 00:00:00 2023 GMT
Not After : Jan 4 23:59:59 2025 GMT
Subject: DC=org, DC=terena, DC=tcs, C=FR, ST=Paris, O=Centre national de la recherche scientifique, CN=ccdcacli432.in2p3.fr
...
X509v3 Subject Alternative Name:
DNS:ccdcacli432.in2p3.fr, DNS:ccxrootdatlas.in2p3.fr
If I change it to this certificate, tests immediatly fail :
Issuer: C=NL, O=GEANT Vereniging, CN=GEANT eScience SSL CA 4
Validity
Not Before: Dec 5 00:00:00 2023 GMT
Not After : Jan 3 23:59:59 2025 GMT
Subject: DC=org, DC=terena, DC=tcs, C=FR, ST=Paris, O=Centre national de la recherche scientifique, CN=ccdcacli422.in2p3.fr
...
X509v3 Subject Alternative Name:
DNS:ccdcacli422.in2p3.fr, DNS:ccdcacli432.in2p3.fr, DNS:ccxrootdatlas.in2p3.fr
I'm not able to reproduce this with a simple xrdcp command. It works in both cases. So I'm wondering if VOs are using special XRootD configuration to disallowed SAN support.
I found this in the XRootD documentation https://xrootd.slac.stanford.edu/doc/gsidocs/XRootDGSIProtocolSpecifications.html
Server identity verification
A crucial part to avoid man-in-the-middle attacks is the client verification of server identity. The basic idea is that the client knows the name of the server it is contacting and expects to find this name in the DN of the server certificate. Complications arise when hostname aliases are used, and/or when the same server certificate is used by more servers, making use of the Subject Alternative Name (SAN) support.
Support for SAN matching is introduced in v4.9, together with alternative ways to resolve the hostname on the client, without necessarily relying on the DNS.
Despite the version, the client has the possibility to defined exceptions via the environment variable XrdSecGSISRVNAMES, a comma-separated list of allowed/disallowed names, supporting wild-cards.
But we use v5.6.3 at IN2P3-CC, so >4.9
The CMS SAM tests raised the same error :
START TIME: Mon Dec 4 15:08:20 UTC 2023 --> running cmsRun -j fjr.xml -p analysis_test.py
04-Dec-2023 15:08:21 UTC Initiating request to open file root://ccxrootdcms.in2p3.fr:1094//pnfs/in2p3.fr/data/cms/disk/data/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root
[2023-12-04 15:08:21.715468 +0000][Error ][AsyncSock ] [ccxrootdcms.in2p3.fr:1094 #0.0] Socket error while handshaking: [FATAL] Auth failed
[2023-12-04 15:08:21.880432 +0000][Error ][AsyncSock ] [ccxrootdcms.in2p3.fr:1094 #0.0] Socket error while handshaking: [FATAL] Auth failed
[2023-12-04 15:08:21.880546 +0000][Error ][PostMaster ] [ccxrootdcms.in2p3.fr:1094 #0] Unable to recover: [FATAL] Auth failed.
I ran the xrootd door in debug mode this morning with a problematic certificate. I'm trying to find some hint in it.
When you say "I'm not able to reproduce this with a simple xrdcp command", you still mean against the dCache endpoints, correct?
If so, then this is indeed a client issue.
Yes a basic xrdcp works for both certificates from a worker node
xrdcp -v -f root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1 /dev/null
[354.1MB/354.1MB][100%][==================================================][177.1MB/s]
I got an answer from CMS team
The CMSSW/xrootd version we use for SAM and in HC testing is very old, the version current at the start of Run 2 (those are
the oldest analysis activities we still support on the grid). It is xrootd v4.6.0.
We know the version has a lot of issues. The plan is to move this forward to v4.8.5 next year but not v4.9 yet. There are also
discussions about porting old, still-in-use CMSSW versions to newer, v5 xrootd version.
Support for SAN matching is only available since v4.9. So this is indeed a client issue as they're still using a xrootd client version from 2017!
Closing this ticket. Cheers
Hi,
We are running different dCache XRootD doors behind dedicated aliases for VOs and we are facing an issue when we started to use certificate of one door, with as alternative names the others hostnames.
ie : we have ccxrootdatlas.in2p3.fr targeting ccdcacli422.in2p3.fr and ccdcacli432.in2p3.fr
With one single certificate with :
It works for everything with this kind of certificate (webdav, pools, ...), except for xrootd door where Atlas HammerCloud tests
and CMS SAM tests began to fail with the error :
[FATAL] Auth failed
These are analysis xrootd doors so only gsi plugin is used :
I try to reproduce the problem with a xrdcp manually from a WN but it worked :
Nothing is logged in XrootD door logs so I think the error is raised directly by the client. Full log from clients : pandaid=6041716912__payload.txt