DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
110 stars 173 forks source link

[hackathon] Only include resources that are actually able to accept jobs on DIRAC certification instance #7658

Open marianne013 opened 2 weeks ago

marianne013 commented 2 weeks ago

The DIRAC certification instance seems to contain a number of resources that haven't worked in months (years?), making it difficult to distinguish real DIRAC errors from site failures.

So far we have: Sites/CE:

LCG htcondor-ce-[1,2,3,4]-kit.gridka.de_condor: submission fails with

/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:Command ['condor_submit', '-terse', '-pool', 'htcondor-ce-3-kit.gridka.de:9619', '-remote', 'htcondor-ce-3-kit.gridka.de', '/opt/dirac/data/HTCondor/work/HTCondorCE_axlxafve.sub'] failed with: 1 - ERROR: Failed to connect to queue manager htcondor-ce-3-kit.gridka.de

According to their bdii the condor CE still support dteam. Should this be followed up or the resource deleted ? If kept, they probably don't run EL7 any longer either.

There is a test that explicitly targets GRIF which hasn't worked in over a year:

/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:2024-06-07T15:52:39,588024Z WorkloadManagement/SiteDirectorDteam/node16.datagrid.cea.fr ERROR: Failed getting the status of the CE. Response: 403 - User can't be assigned configuration

Ask site or retire ?

CERN:

/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:2024-06-07T15:41:24,161662Z WorkloadManagement/SiteDirectorDteam/WorkloadManagement/SiteDirectorDteam ERROR: The following errors occurred during the pilot submission operation Command ['condor_submit', '-terse', '-pool', 'ce504.cern.ch:9619', '-remote', 'ce504.cern.ch', '/opt/dirac/data/HTCondor/work/HTCondorCE_no8_ygwr.sub'] failed with: 1 - ERROR: Failed to connect to queue manager ce504.cern.ch

I think giving up on CERN is a bad idea.

LCG.NCBJ.pl One HelloWorld job recently succeeded, but the rest fails. Possibly related to storage elements ?

LCG.RAL.uk Currently doesn't work due to: #7657, but should work in principle

Imperial, Glasgow, RALPPD: Should all work and can be fixed if it doesn't. Imperial cloud is currently (07/06/24) broken, but Simon and me are onto it. Everything else should work.

Storage Elements: dcache.du.cesnet.cz (CESNET-SE) does not exist, at least I couldn't find it in the gocdb. Remove from config ? IN2P3-SE ( ccsrm.in2p3.fr) does exist, needs verifying to see if it still takes dteam data. RAL-SE does exist, but does it still take dteam ? UKI-LT2-IC-HEP-disk and UKI-SOUTHGRID-RALPP-disk should work and if not the sites should be told. We could probably rope Glasgow in, if you need a further storage element.

Once we have a set of sites/storage elements that should work, we should then remove any obsolete ones from the tests. I think this ticket should be a group effort :-D.

marianne013 commented 2 weeks ago

Also, RAL: Currently the OS is defined for the CE, but for RAL this does not make sense, it needs to hand of the queue. We use a slightly different approach ("Platform", no-one but Simon understands this), so each queue is either EL7, EL8, EL9 and the OS is None. This seems to work fine, but how do I define on the certification machine

fstagni commented 2 weeks ago

Good points. But I am not sure that bdii provides reliable information. What does goc say?

Given that we need to move to the new DIRAC certification instance, and that this is something that we should finalize, as a group effort, in the workshop's hackathon, we can redefine the list of resources the new setup.

marianne013 commented 2 weeks ago

GOCDB does not give information as to what VOs are supported. We've been pointing this out for quite a while. In the end we will have to email the sites.
I'm currently trying to convince Glasgow to give us another SE for testing.