No jobs submitted to global pool by JustIN for last day

DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.

2 stars 0 forks source link

No jobs submitted to global pool by JustIN for last day #177

Open StevenCTimm opened 1 month ago

StevenCTimm commented 1 month ago

User reports in slack that JustIN wasn't run anything, I confirm AWT jobs not running either. condor_status -schedd show that both Justin-prod-sched machines are not currently talking to collector dunegpcoll01.

StevenCTimm commented 1 month ago

I have opened an issue with Fermilab INC1177158

What we are seeing is that the Negotiator of the global pool cannot connect to these schedds.

07/17/24 12:49:28 attempt to connect to <130.246.81.93:9618> failed: Connection refused (connect errno = 111). 07/17/24 12:49:28 Failed to connect to dunejustin@fnal.gov (<130.246.81.93:9 618?addrs=130.246.81.93-9618&alias=justin-prod-sched02.dune.hep.ac.uk&noUDP&sock =schedd_47516_9a2e>)

It is not clear at this point if the problem is on RAL end or on Fermilab end.. both sides should continue with diagnosing. The "connection refused" error leads me to believe it is on the RAL end but I cannot prove that.

StevenCTimm commented 1 month ago

we lost contact with Justin-prod-sched02 at 12:49 fermilab time yesterday and we lost contact with Justin-prod-sched01 at 13:48 Fermilab time yesterday.

StevenCTimm commented 1 month ago

The "connection refused" thing leads me to suspect some type of a firewall setting on the Justin-prod-sched01 end. Any way to check and see if the firewalld and/or nftables settings have changed recently?

StevenCTimm commented 1 month ago

Chris B. says there was a problem that the spool space was filled up on the schedd's.. after whatever he did we can now see justin-prod-sched01 but still not Justin-prod-sched02

StevenCTimm commented 1 month ago

Chris follows up that the spool disks quickly filled up again.

StevenCTimm commented 1 month ago

Justin-prod-sched01 has been offline all weekend which means that no AWT jobs have been submitted or run. Some jobs did run on Justin-prod-sched02 but none are pending at the moment