Open StevenCTimm opened 1 month ago
I have opened an issue with Fermilab INC1177158
What we are seeing is that the Negotiator of the global pool cannot connect to these schedds.
07/17/24 12:49:28 attempt to connect to <130.246.81.93:9618> failed: Connection refused (connect errno = 111). 07/17/24 12:49:28 Failed to connect to dunejustin@fnal.gov (<130.246.81.93:9 618?addrs=130.246.81.93-9618&alias=justin-prod-sched02.dune.hep.ac.uk&noUDP&sock =schedd_47516_9a2e>)
It is not clear at this point if the problem is on RAL end or on Fermilab end.. both sides should continue with diagnosing. The "connection refused" error leads me to believe it is on the RAL end but I cannot prove that.
we lost contact with Justin-prod-sched02 at 12:49 fermilab time yesterday and we lost contact with Justin-prod-sched01 at 13:48 Fermilab time yesterday.
The "connection refused" thing leads me to suspect some type of a firewall setting on the Justin-prod-sched01 end. Any way to check and see if the firewalld and/or nftables settings have changed recently?
Chris B. says there was a problem that the spool space was filled up on the schedd's.. after whatever he did we can now see justin-prod-sched01 but still not Justin-prod-sched02
Chris follows up that the spool disks quickly filled up again.
Justin-prod-sched01 has been offline all weekend which means that no AWT jobs have been submitted or run. Some jobs did run on Justin-prod-sched02 but none are pending at the moment
User reports in slack that JustIN wasn't run anything, I confirm AWT jobs not running either. condor_status -schedd show that both Justin-prod-sched machines are not currently talking to collector dunegpcoll01.