DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

RAL Stratum 1 issue #109

Closed StevenCTimm closed 9 months ago

StevenCTimm commented 9 months ago

We have 3 sites where fifeuser4.opensciencegrid.org repo is way behind, namely

Jake Calcutt 2:42 AM Here is info about the RCDS issue I'm seeing Hi Jacob, checking RCDS logs we know the code tarball was properly published on RCDS CVMFS volumes (thanks Shreyas). This job was running at Manchester site, and checking CVMFS revisions for RCDS repositories it looks like at the site the revision are left quite behind. According to Dave D. this is due to an already known RAL stratum1 problem. I'll try to ping them to check what the status is.

In the meantime I'd suggest to avoid to submit jobs that require RCDS at Manchester site.

-Vito I've found this issue on the following sites: Manchester SGrid SGridOxford

StevenCTimm commented 9 months ago

E-mailed Dave Dykstra to ask for more details. Is it time to put another dune-focused stratum 1 in the UK?

Steve

StevenCTimm commented 9 months ago

Dave says there's an active GGUS ticket.

Andrew-McNab-UK commented 9 months ago

ALICE , CMS and LHCb tickets for RAL Tier-1 about this.

Manchester is getting complaints from MicroBooNE about this too, due to their jobs that use RCDS.

Andrew-McNab-UK commented 9 months ago

RAL have just now sent this round:

“The CVMFS Stratum-1 service at RAL is experiencing problems since the 29th of September. Since then, the repositories are not being updated and, therefore, the service is distributing out of date content. This is causing a lot of failures on running jobs, as the CVMFS client is not aware of that.

We are currently investigating the root cause of the problem. Meanwhile, we have set the service offline. The hope is that, when the clients are unable to contact it, they will try with another Stratum-1 with up-to-date content.

As our Frontier-Squid most probably also has out of date content, we have disabled it as well.

As soon as the problem gets fixed, we will re-enable the services and communicate it accordingly. “

Andrew-McNab-UK commented 9 months ago

RAL has updated the above tickets after fixing their stratum 1 and putting it back online. The experiments are reporting that things are ok now.