DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Onboard DURHAM as a DUNE site #38

Closed StevenCTimm closed 7 months ago

StevenCTimm commented 1 year ago

BOUTCHER, ADAM J. adam.j.boutcher@durham.ac.uk

To:Steven C Timm Cc:CLARK, PAUL W.J. paul.w.clark@durham.ac.uk Mon 1/30/2023 9:12 AM Hi Steve,

I've been tasked with trying to support DUNE here at UKI-SCOTGRID- DURHAM.

I've (hopefully) setup the VO to run on our CE3 and CE4 ARC CE's. If there's anything specific that you require please let me know. And if you could test out some submissions (if you wish to use us) then I'd be really grateful for feedback etc.

Thanks Adam

-- Adam Boutcher Senior Technical Manager (IPPP Computing Service) Institute for Particle Physics Phenomenology (IPPP) OC216, Ogden Centre (East) Durham University South Road Durham DH1 3LE, United Kingdom

Tel: +44 (0)191 33 43527

Email: adam.j.boutcher@durham.ac.uk

StevenCTimm commented 1 year ago

I've responded to Adam asking for host names of their CE's once that is in hand we can open a ticket with the OSG to add to the factory.

StevenCTimm commented 1 year ago

Hi,

No worries.

ce3.dur.scotgrid.ac.uk and ce4.dur.scotgrid.ac.uk

We're an ARC CE site.

Let me know if you need any more details.

Adam

kherner commented 1 year ago

Looks like Durham doesn't have any existing entries for another VO in the factories, so this will be a new entry. In addition to the CE names, which we already have, we need

1) Number of cores and total memory for the glideins. 2) The RSL string needed to get to the proper queues with the requested resources 3) Names for GLIDEIN_Site, GLIDEIN_DUNESite, and GLIDEIN_ResourceName. For the latter we usually try to match the EGI name, if it exists. In this case it would be UKI-SCOTGRID-DURHAM. So I would just stick with that for GLIDEIN_ResourceName. For the first two, how about simply UK_Durham? Or UK_DURHAM if all caps is preferred for some reason.

kherner commented 1 year ago

Also, you support Singularity/Apptainer, right? And we need the usual CVMFS repositories available. At minimum they are

dune.opensciencegrid.org fermilab.opensciencegrid.org fifeuser1.opensciencegrid.org fifeuser2.opensciencegrid.org fifeuser3.opensciencegrid.org fifeuser4.opensciencegrid.org larsoft.opensciencegrid.org singularity.opensciencegrid.org # default location for containers dune.osgstorage.org # This is a "StashCache" repository needed for some MC generation workflows

Andrew-McNab-UK commented 1 year ago

I think UK_Durham for the DUNE site name since it’s a place not an abbreviation.

adamboutcher commented 1 year ago

So these are what I proposed based on ECDF, however we're not too picky, so change as you need really.

Glidein Site name = UK_SGridDurham Glidein DUNE Site name = UK_Durham Glidein Resource name. = UKI-SCOTGRID-DURHAM (Based on the EGI/GOCDB naming)

Number of cores per glidein = 1,2,4 or 8 (we'd prefer 8 but whatever works best for you really) Max memory per glidein = 2 GB per core Max run time of the glideins = 2 days

kherner commented 1 year ago

OK, thanks! I'll start a ticket with the OSG factory ops to get a new entry set up.

kherner commented 1 year ago

Ah, sorry, I also forgot to ask what the proper queue names are for the pilots; that might be needed to get the RSL strings right for the ARC CE.

adamboutcher commented 1 year ago

we just have a single queue per CE.

ce3.dur.scotgrid.ac.uk - queue name ce3 ce4.dur.scotgrid.ac.uk - queue name ce4

StevenCTimm commented 1 year ago

https://support.opensciencegrid.org/support/tickets/public/09c832812acfca85656562fb832cb5fd1522af7279b24dda09e21d727d79c3e5

That's the URL to the OSG ticket to get DURHAM into the factory

StevenCTimm commented 1 year ago

Factory reports that glideins are now starting at Durham.

StevenCTimm commented 1 year ago

Analysis shows that glideins are failing because they can't find singularity/apptainer.

adamboutcher commented 1 year ago

We don't provide a local singularity, we recommend utilising one from CVMFS.

StevenCTimm commented 1 year ago

Discussed in the ops meeting.. glideinwms should be smart enough to find the apptainer from CVMFS and apparently it is not doing so, we will have to investigate the factory log.

StevenCTimm commented 7 months ago

It worked briefly but then broke again this is tracked in #101