dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
292 stars 136 forks source link

XRootD issue with hostname as SAN in certificate #7450

Closed ageorget closed 12 months ago

ageorget commented 12 months ago

Hi,

We are running different dCache XRootD doors behind dedicated aliases for VOs and we are facing an issue when we started to use certificate of one door, with as alternative names the others hostnames.

ie : we have ccxrootdatlas.in2p3.fr targeting ccdcacli422.in2p3.fr and ccdcacli432.in2p3.fr

With one single certificate with :

    Signature Algorithm: sha384WithRSAEncryption
        Issuer: C=NL, O=GEANT Vereniging, CN=GEANT eScience SSL CA 4
        Validity
            Not Before: Dec  5 00:00:00 2023 GMT
            Not After : Jan  3 23:59:59 2025 GMT
        Subject: DC=org, DC=terena, DC=tcs, C=FR, ST=Paris, O=Centre national de la recherche scientifique, CN=ccdcacli422.in2p3.fr
...
            X509v3 Subject Alternative Name:
                DNS:ccdcacli422.in2p3.fr, DNS:ccdcacli432.in2p3.fr, DNS:ccxrootdatlas.in2p3.fr

It works for everything with this kind of certificate (webdav, pools, ...), except for xrootd door where Atlas HammerCloud tests

and CMS SAM tests began to fail with the error : [FATAL] Auth failed

These are analysis xrootd doors so only gsi plugin is used :


[xrootd-ccdcacli432Domain]
[xrootd-ccdcacli432Domain/xrootd]
xrootd.cell.name=xrootd-ccdcacli432
xrootd.plugins=gplazma:gsi,authz:none

I try to reproduce the problem with a xrdcp manually from a WN but it worked :

xrdcp -v -f root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1 /dev/null
[354.1MB/354.1MB][100%][==================================================][177.1MB/s]

Nothing is logged in XrootD door logs so I think the error is raised directly by the client. Full log from clients : pandaid=6041716912__payload.txt

RootFileHandler     ERROR Unable to open ROOT file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1" with options "READ"
FileMgr           WARNING open of file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1", tech: "ROOT", flags: "READ" requested by RootCollection failed. return code: 1
RootFileHandler     ERROR Unable to open ROOT file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1" with options "READ"
FileMgr           WARNING open of file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1", tech: "ROOT", flags: "READ" requested by RootDatabase failed. return code: 1
RootDatabase.open Error You cannot open the ROOT file [root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1] in mode READ if it does not exists.
StorageSvc Error Cannot connect to Database: FID=8CE4807D-2A9D-E549-B64E-67D4C302CE19 PFN=root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
RootFileHandler     ERROR Unable to open ROOT file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1" with options "READ"
FileMgr           WARNING open of file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1", tech: "ROOT", flags: "READ" requested by RootDatabase failed. return code: 1
RootDatabase.open Error You cannot open the ROOT file [root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1] in mode READ if it does not exists.
StorageSvc Error Cannot connect to Database: FID=8CE4807D-2A9D-E549-B64E-67D4C302CE19 PFN=root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
RootFileHandler     ERROR Unable to open ROOT file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1" with options "READ"
FileMgr           WARNING open of file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1", tech: "ROOT", flags: "READ" requested by RootDatabase failed. return code: 1
RootDatabase.open Error You cannot open the ROOT file [root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1] in mode READ if it does not exists.
StorageSvc Error Cannot connect to Database: FID=8CE4807D-2A9D-E549-B64E-67D4C302CE19 PFN=root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
PoolCollectionC...WARNING Unable to create Collection: PFN:root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
PoolCollectionC...WARNING Could not connect to the file ( POOL : "PersistencySvc::UserDatabase::connectForRead" from "PersistencySvc" )
EventSelector       ERROR Unable to open: root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
AthenaSummarySvc     INFO  -> file incident: root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1 [GUID: ]
EventSelector       FATAL in sysInitialize(): exception with tag=EventSelector is caught
EventSelector       ERROR EventSelector        Unable to open: root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1         StatusCode=FAILURE
ServiceManager      ERROR Unable to initialize service "EventSelector"
MetaDataSvc.Ser...  ERROR ServiceLocatorHelper::service: can not locate service EventSelector
MetaDataSvc       WARNING Cannot get EventSelector.
HistogramPersis...WARNING Histograms saving not required.
IoComponentMgr       INFO IoComponent[EventSelector] already registered @0x116c6d88
RootFileHandler     ERROR Unable to open ROOT file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1" with options "READ"
FileMgr           WARNING open of file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1", tech: "ROOT", flags: "READ" requested by RootCollection failed. return code: 1
RootFileHandler     ERROR Unable to open ROOT file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1" with options "READ"
FileMgr           WARNING open of file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1", tech: "ROOT", flags: "READ" requested by RootDatabase failed. return code: 1
RootDatabase.open Error You cannot open the ROOT file [root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1] in mode READ if it does not exists.
StorageSvc Error Cannot connect to Database: FID=8CE4807D-2A9D-E549-B64E-67D4C302CE19 PFN=root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
RootFileHandler     ERROR Unable to open ROOT file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1" with options "READ"
FileMgr           WARNING open of file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1", tech: "ROOT", flags: "READ" requested by RootDatabase failed. return code: 1
RootDatabase.open Error You cannot open the ROOT file [root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1] in mode READ if it does not exists.
StorageSvc Error Cannot connect to Database: FID=8CE4807D-2A9D-E549-B64E-67D4C302CE19 PFN=root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
RootFileHandler     ERROR Unable to open ROOT file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1" with options "READ"
FileMgr           WARNING open of file "root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1", tech: "ROOT", flags: "READ" requested by RootDatabase failed. return code: 1
RootDatabase.open Error You cannot open the ROOT file [root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1] in mode READ if it does not exists.
StorageSvc Error Cannot connect to Database: FID=8CE4807D-2A9D-E549-B64E-67D4C302CE19 PFN=root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
PoolCollectionC...WARNING Unable to create Collection: PFN:root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
PoolCollectionC...WARNING Could not connect to the file ( POOL : "PersistencySvc::UserDatabase::connectForRead" from "PersistencySvc" )
EventSelector       ERROR Unable to open: root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
AthenaSummarySvc     INFO  -> file incident: root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1 [GUID: ]
EventSelector       FATAL in sysInitialize(): exception with tag=EventSelector is caught
EventSelector       ERROR EventSelector        Unable to open: root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1         StatusCode=FAILURE
ServiceManager      ERROR Unable to initialize service "EventSelector"
AthenaEventLoopMgr  FATAL No valid event selector called EventSelectorAthenaPool/EventSelector
ServiceManager      ERROR Unable to initialize Service: AthenaEventLoopMgr
Py:Athena            INFO leaving with code 33: "failure in initialization"
ApplicationMgr      ERROR Application Manager Terminated with error code 2
Warning in <TInterpreter::ReadRootmapFile>: class  Event found in libG4AtlasControlDict.so  is already in libtest_GPyTestDict.so
Warning in <TInterpreter::ReadRootmapFile>: class  UCharDbArray found in libStorageSvcDict.so  is already in libRootCnvDict.so
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
Error in <TNetXNGFile::Open>: [FATAL] Auth failed
alrossi commented 12 months ago

I am unfamiliar with the ApplicationManager, but the logging I see here is not too helpful for diagnosing what the problem is in dCache. For instance, "Error You cannot open the ROOT file [root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1] in mode READ if it does not exists". does not look like a dCache error message.

If there is an authentication error here, perhaps an inspection of the dCache door log and dCache gPlazma log might reveal something else.

Offhand, a certificate with SANs which are hostnames (not IPs) should work with dCache xroot (and the xroot client).

alrossi commented 12 months ago

Sorry, I missed your comment

Nothing is logged in door logs so I think the error is raised directly by the client.
[FATAL] Auth failed

is indeed an xrootd copy client message, but something must be happening on the server end for it to fail.

Is it possible to get more debug-level logging out of the client similar to xrdcp -d 3 ?

alrossi commented 12 months ago

What is RootDatabase ?

Are permissions on it keyed to a domain name?

ageorget commented 12 months ago

Hi Albert,

What is RootDatabase ?

Are permissions on it keyed to a domain name?

This is on Atlas HammerCloud side, not related to dCache.

With this certificate installed on ccdcacli432 CMS SAM tests and ATLAS HC tests are OK :

Issuer: C=NL, O=GEANT Vereniging, CN=GEANT eScience SSL CA 4
        Validity
            Not Before: Dec  6 00:00:00 2023 GMT
            Not After : Jan  4 23:59:59 2025 GMT
        Subject: DC=org, DC=terena, DC=tcs, C=FR, ST=Paris, O=Centre national de la recherche scientifique, CN=ccdcacli432.in2p3.fr
...        
        X509v3 Subject Alternative Name: 
                DNS:ccdcacli432.in2p3.fr, DNS:ccxrootdatlas.in2p3.fr

If I change it to this certificate, tests immediatly fail :


        Issuer: C=NL, O=GEANT Vereniging, CN=GEANT eScience SSL CA 4
        Validity
            Not Before: Dec  5 00:00:00 2023 GMT
            Not After : Jan  3 23:59:59 2025 GMT
        Subject: DC=org, DC=terena, DC=tcs, C=FR, ST=Paris, O=Centre national de la recherche scientifique, CN=ccdcacli422.in2p3.fr
...
            X509v3 Subject Alternative Name: 
                DNS:ccdcacli422.in2p3.fr, DNS:ccdcacli432.in2p3.fr, DNS:ccxrootdatlas.in2p3.fr

I'm not able to reproduce this with a simple xrdcp command. It works in both cases. So I'm wondering if VOs are using special XRootD configuration to disallowed SAN support.

I found this in the XRootD documentation https://xrootd.slac.stanford.edu/doc/gsidocs/XRootDGSIProtocolSpecifications.html

Server identity verification
A crucial part to avoid man-in-the-middle attacks is the client verification of server identity. The basic idea is that the client knows the name of the server it is contacting and expects to find this name in the DN of the server certificate. Complications arise when hostname aliases are used, and/or when the same server certificate is used by more servers, making use of the Subject Alternative Name (SAN) support.
Support for SAN matching is introduced in v4.9, together with alternative ways to resolve the hostname on the client, without necessarily relying on the DNS.
Despite the version, the client has the possibility to defined exceptions via the environment variable XrdSecGSISRVNAMES, a comma-separated list of allowed/disallowed names, supporting wild-cards.

But we use v5.6.3 at IN2P3-CC, so >4.9

The CMS SAM tests raised the same error :

START TIME: Mon Dec 4 15:08:20 UTC 2023 --> running cmsRun -j fjr.xml -p analysis_test.py
04-Dec-2023 15:08:21 UTC Initiating request to open file root://ccxrootdcms.in2p3.fr:1094//pnfs/in2p3.fr/data/cms/disk/data/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root
[2023-12-04 15:08:21.715468 +0000][Error ][AsyncSock ] [ccxrootdcms.in2p3.fr:1094 #0.0] Socket error while handshaking: [FATAL] Auth failed
[2023-12-04 15:08:21.880432 +0000][Error ][AsyncSock ] [ccxrootdcms.in2p3.fr:1094 #0.0] Socket error while handshaking: [FATAL] Auth failed
[2023-12-04 15:08:21.880546 +0000][Error ][PostMaster ] [ccxrootdcms.in2p3.fr:1094 #0] Unable to recover: [FATAL] Auth failed.

I ran the xrootd door in debug mode this morning with a problematic certificate. I'm trying to find some hint in it.

alrossi commented 12 months ago

When you say "I'm not able to reproduce this with a simple xrdcp command", you still mean against the dCache endpoints, correct?

If so, then this is indeed a client issue.

ageorget commented 12 months ago

Yes a basic xrdcp works for both certificates from a worker node

xrdcp -v -f root://ccxrootdatlas.in2p3.fr:1094//pnfs/in2p3.fr/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1 /dev/null
[354.1MB/354.1MB][100%][==================================================][177.1MB/s]
ageorget commented 12 months ago

I got an answer from CMS team

The CMSSW/xrootd version we use for SAM and in HC testing is very old, the version current at the start of Run 2 (those are
the oldest analysis activities we still support on the grid). It is xrootd v4.6.0.
We know the version has a lot of issues. The plan is to move this forward to v4.8.5 next year but not v4.9 yet. There are also
discussions about porting old, still-in-use CMSSW versions to newer, v5 xrootd version.

Support for SAN matching is only available since v4.9. So this is indeed a client issue as they're still using a xrootd client version from 2017!

Closing this ticket. Cheers