DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Split US_BNL site into US_BNL-RACF and US_BNL-SDCC #96

Open StevenCTimm opened 1 year ago

StevenCTimm commented 1 year ago

Or coordinate with Doug appropriately on the right names The RACF (ATLAS) nodes we are opportunistic on and they allow singularity in singularity. The SDCC nodes we have a dedicated allocation on and they don't.

Andrew-McNab-UK commented 1 year ago

Do we need to do this? It will not be needed in the next release of justIN, which bases CPU resource management on entries rather than sites.

kherner commented 1 year ago

Sounds like we just need to be sure they they have different GLIDEIN_DUNESite names.

Andrew-McNab-UK commented 1 year ago

I think we should base this on who the tickets should be sent to. If some random DUNE ops person outside the site needs to submit a ticket, does it go to the same place for both subsites or to the same place? Same place = one site. Different places = two subsites?

Another point is that if the storage at the site has onsite vs offsite differences in Rucio, then the RSE's site attribute should be set to the prefix (eg US_FNAL or US_BNL) rather than a full subsite name (US_FNAL-FermiGrid or US_BNL-SDCC) Then I can make justIN realise this is going on and set the distance to zero for the same prefix ones. But for sites like UK_RAL-Tier1 vs UK_RAL-PPD they really are different sites with separate storages and networking, and the distance is non-zero.

StevenCTimm commented 1 year ago

The original suggestion I made because the two sub-sites have differences both in memory and in whether they support user namespaces or not.. the existing unified site showed Justin job scripts failing sometimes and succeeding other times. I am not sure of the current config of rucio at BNL whether a different port is still necessary inside vs. outside, we have to check that.

StevenCTimm commented 1 year ago

Also eventually we are going to want to do as CMS does and have different frontend groups for those sites where we have allocations as opposed to those where we are just opportunistic.. that can make sure we fill our allocated sites first.