DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

UK_RAL-Tier1 failing JustIN writes back to Fermilab only. #62

Closed StevenCTimm closed 1 year ago

StevenCTimm commented 1 year ago

jobsub admins have blocked any user jobs (including justIN) from running at RAL Tier1 due to ongoing asymmetric routing network problems. These are set to be fixed next week once Fermilab public dcache goes completely onto lhcone, scheduled for Apr 19. Won't file any tickets until we hear from Phil DeMar.

StevenCTimm commented 1 year ago

According to discussion at the operations meeting on 4/17 this has been fixed, and test jobs were successful.

Will leave this open until we start seeing JustIN jobs go through again.

StevenCTimm commented 1 year ago

The jobs are going through again but they are still having trouble writing to FNAL Dcache from ral tier1

StevenCTimm commented 1 year ago

Robert Illingworth requested us to make a SNOW ticket on this so Fermilab end is informed.

StevenCTimm commented 1 year ago

The error seems to have gone away on its own. JustIN can write back to Fermilab now

Will keep watching to make sure it doesn't recur in short order.

StevenCTimm commented 1 year ago

Have now opened Service Now ticket with Fermilab, RITM1716559

That is at Robert Illingworth's request. Failures are still intermittent.

It was thought that it might be uboone hogging all the network traffic but they've been blacklisted at this site since Apr 21 and the network errors still continue.

StevenCTimm commented 1 year ago

This behavior has stopped now. Don't know why.