cvmfs / cvmfs

The CernVM File System
http://cernvm.cern.ch/portal/filesystem
BSD 3-Clause "New" or "Revised" License
294 stars 132 forks source link

hung /cvmfs mounts #3432

Open stuartthebruce opened 1 year ago

stuartthebruce commented 1 year ago

Hung /cvmfs mounts on multiple RL8.8 system, e.g.,

[root@node538 ~]# strace df -h
execve("/usr/bin/df", ["df", "-h"], 0x7ffe597f9908 /* 41 vars */) = 0
brk(NULL)                               = 0x55efb5cb0000
...
stat("/boot", {st_mode=S_IFDIR|0555, st_size=4096, ...}) = 0
stat("/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/cvmfs/ligo.storage.igwn.org", ^C^C^C^Z
[1]+  Stopped                 strace df -h
[root@node538 ~]# kill -9 %1

[1]+  Stopped                 strace df -h
[root@node538 ~]# pgrep -c -f "umount /cvmfs/ligo.storage.igwn.org"
1795

which persist even after running,

[root@node538 ~]# cvmfs_config killall
Terminating cvmfs_config processes... OK
Terminating cvmfs2 processes... OK
Unmounting stale mount points... Killed

Please find attached the output of cvmfs_config bugreport, but note at several points I need to kill hung processes, e.g., /bin/df, for the report generation to continue,

[root@node538 ~]# cvmfs_config bugreport
Gathering /etc/cvmfs
Gathering files in quarantaine
Gathering stack traces
Gathering uname -a
Gathering cat /etc/issue
Gathering hostname -f
Gathering ifconfig -a
Gathering cvmfs2 --version
Gathering ls -lR /var/run/cvmfs
Gathering grep cvmfs /var/log/messages
Gathering grep cvmfs /var/log/syslog
Gathering find /usr/lib /usr/lib64 /lib /lib64 -name libfuse*
Gathering journalctl -alm /usr/bin/cvmfs2
Gathering journalctl -alm /usr/libexec/cvmfs/cache/cvmfs_cache_ram
Gathering eval find /var/lib/cvmfs -maxdepth 1 -exec ls -lah \{\} \;
Gathering cvmfs_config probe
Gathering mount
Gathering df -h
/usr/bin/cvmfs_config: line 1489: 3920671 Terminated              $cmd >> $out 2> $err
Gathering ps -ef
Gathering cvmfs_config status
/usr/bin/cvmfs_config: line 1489: 3920699 Terminated              $cmd >> $out 2> $err
Gathering cvmfs_config showconfig
Gathering cvmfs_config chksetup
Gathering cvmfs_config stat -v
/usr/bin/cvmfs_config: line 1489: 3922197 Terminated              $cmd >> $out 2> $err
Gathering cvmfs_talk internal affairs
Gathering cat /etc/fuse.conf
Gathering ls -la /usr/bin/fusermount
Gathering ls -la /bin/fusermount
Gathering cat /etc/auto.master
Gathering cat /etc/autofs/auto.master
Gathering cat /etc/sysconfig/autofs
Gathering cat /etc/default/autofs
Gathering cat /etc/conf.d/autofs
Gathering journalctl -almu autofs.service
Gathering cat /etc/fstab
Gathering cat /etc/exports
Gathering cat /proc/mounts
Gathering cat /proc/cpuinfo
Gathering cat /etc/hosts
Gathering free -m
Gathering systemctl show autofs.service
Gathering id cvmfs
Gathering ls -la /etc/auto.master.d/*
/usr/bin/cvmfs_config: line 1494: lslaetcautomasterdarchiveautofsetcautomasterdcephautofsetcautomasterdcudaautofsetcautomasterdcvmfsautofsetcautomasterddirectautofsetcautomasterdhomeautofsetcautomasterdifocacheautofsetcautomasterdldcgautofsetcautomasterdmdchomeautofsetcautomasterdoneapiautofsetcautomasterdscratchautofs.stdout: File name too long
/usr/bin/cvmfs_config: line 1495: lslaetcautomasterdarchiveautofsetcautomasterdcephautofsetcautomasterdcudaautofsetcautomasterdcvmfsautofsetcautomasterddirectautofsetcautomasterdhomeautofsetcautomasterdifocacheautofsetcautomasterdldcgautofsetcautomasterdmdchomeautofsetcautomasterdoneapiautofsetcautomasterdscratchautofs.stdout: File name too long
Gathering cat /etc/auto.master.d/*
/usr/bin/cvmfs_config: line 1494: catetcautomasterdarchiveautofsetcautomasterdcephautofsetcautomasterdcudaautofsetcautomasterdcvmfsautofsetcautomasterddirectautofsetcautomasterdhomeautofsetcautomasterdifocacheautofsetcautomasterdldcgautofsetcautomasterdmdchomeautofsetcautomasterdoneapiautofsetcautomasterdscratchautofs.stdout: File name too long
/usr/bin/cvmfs_config: line 1495: catetcautomasterdarchiveautofsetcautomasterdcephautofsetcautomasterdcudaautofsetcautomasterdcvmfsautofsetcautomasterddirectautofsetcautomasterdhomeautofsetcautomasterdifocacheautofsetcautomasterdldcgautofsetcautomasterdmdchomeautofsetcautomasterdoneapiautofsetcautomasterdscratchautofs.stdout: File name too long

System information has been collected in /tmp/cvmfs-bugreport.fEMrMB/cvmfs-bugreport.tar.gz
Please attach this file to your problem description and send it as a
bug report to https://github.com/cvmfs/cvmfs/issues

cvmfs-bugreport.tar.gz

DrDaveD commented 1 year ago

Thank you, you did the right thing in reporting this the way you did. A lot of people have had this trouble and so far we haven't had any success with reproducing it or tracking down the root cause. The overall tracking issue is #3378.

This hang was apparently on October 25. The messages are no longer in grepcvmfsvarlogmessages.stdout but are in journalctlalmuautofsservice.stdout. Here are the last messages for that repo:

Oct 25 07:03:55 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switched to catalog revision 9606
Oct 25 07:17:12 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:17:12 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:17:22 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:17:33 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371361280-4096.gwf (hash: 4575229e452c097a347ae794b53f7c33483003f2, error 15 [host serving data too slowly])
Oct 25 07:17:33 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371361280-4096.gwf
Oct 25 07:17:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:18:16 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:18:16 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:18:26 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:18:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371361280-4096.gwf (hash: 7d55d5e8896267925caeffe58b59075c76e99ac6, error 15 [host serving data too slowly])
Oct 25 07:18:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371361280-4096.gwf
Oct 25 07:18:49 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:19:01 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:19:02 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:19:13 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf (hash: e26379a88ae44df48e09cff7fb028e3adc320c10, error 15 [host serving data too slowly])
Oct 25 07:19:13 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf
Oct 25 07:19:14 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:19:24 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:21:11 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host data transfer cut short)
Oct 25 07:21:11 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:21:21 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:21:32 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371365376-4096.gwf (hash: 9c033baf74a053e2569c30795b038ff1b04f2143, error 15 [host serving data too slowly])
Oct 25 07:21:32 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371365376-4096.gwf
Oct 25 07:21:40 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:21:50 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:21:51 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:22:01 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf (hash: e2cc8e52e65247de9ccb95cb0d28f10e067d0bca, error 15 [host serving data too slowly])
Oct 25 07:22:01 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf
Oct 25 07:22:09 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:22:20 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:23:22 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:23:22 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:23:32 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:23:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf (hash: b660415ce1fad446259bb2cf7dd7278561058055, error 15 [host serving data too slowly])
Oct 25 07:23:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf
Oct 25 07:23:46 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:24:21 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:24:21 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:24:31 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf (hash: b660415ce1fad446259bb2cf7dd7278561058055, error 15 [host serving data too slowly])
Oct 25 07:24:31 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf
Oct 25 07:24:35 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:24:47 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:24:47 node538 cvmfs2[650307]: (ligo.storage.igwn.org) No auth token found in returned JSON from Authz helper /cvmfs/config-osg.opensciencegrid.org/libexec/authz/cvmfs_scitoken_helper

This is similar to the messages seen in other cases, although I think this is the first case of an OSG site; I think the others were all in Europe. There was one case that was an unauthenticated repository so we have been trying to reproduce it with one of those because it's simpler, but most of them have been a LIGO repository so we probably need to expand to trying to reproduce it with authentication. #3375 also had a message about "No auth token found", maybe that's important.

Is there any way you could give temporary access to one of these hung machines to someone on the cvmfs development team, or me?

If you have more than one it would be interesting to see if you can kill the relevant cvmfs2 process and get the machine unhung. Other people have reported that this problem can only be cleared up with a reboot.

josh-willis commented 1 year ago

I'll note here that it's strange to me that this node, which is in Pasadena, was attempting to pull data from Kansas City, Denver, and Lincoln, rather than our cache in Pasadena, or others in southern California. That surely didn't help, though I don't claim it's the real cause.

DrDaveD commented 1 year ago

I'll note here that it's strange to me that this node, which is in Pasadena, was attempting to pull data from Kansas City, Denver, and Lincoln, rather than our cache in Pasadena, or others in southern California.

Please file an OSG ticket about that. It's probably a Maxmind database issue, which Fabio knows how to fix.

stuartthebruce commented 1 year ago

I'll note here that it's strange to me that this node, which is in Pasadena, was attempting to pull data from Kansas City, Denver, and Lincoln, rather than our cache in Pasadena, or others in southern California.

Please file an OSG ticket about that. It's probably a Maxmind database issue, which Fabio knows how to fix.

Multiple attempts have been made to get MaxMind to update their GeoIP database.

stuartthebruce commented 1 year ago

Is there any way you could give temporary access to one of these hung machines to someone on the cvmfs development team, or me?

Yes, and I will move that discussion to email.

If you have more than one it would be interesting to see if you can kill the relevant cvmfs2 process and get the machine unhung. Other people have reported that this problem can only be cleared up with a reboot.

We had several and were not able to unwedge them with kill -9.

vvolkl commented 1 year ago

@stuartthebruce Great! please do include me as well.

DrDaveD commented 1 year ago

Multiple attempts have been made to get MaxMind to update their GeoIP database.

This is the first time I have ever heard that a request to update the MaxMind DB had a problem. In my experience they have always applied a requested change to their next update in their 2 week cycle. I would like to know the details of who tried what when.

stuartthebruce commented 1 year ago

Multiple attempts have been made to get MaxMind to update their GeoIP database.

This is the first time I have ever heard that a request to update the MaxMind DB had a problem. In my experience they have always applied a requested change to their next update in their 2 week cycle. I would like to know the details of who tried what when.

The CIT HEP group requested ultralight.org be relocated from Kansas to Pasadena starting on Oct 2.

josh-willis commented 1 year ago

And I believe that they did that through this link: https://www.maxmind.com/en/geoip-data-correction-request, as we were instructed to do.

DrDaveD commented 1 year ago

Let's take the geoIP issue out of this ticket because it's not related. I'm on an email thread with Stuart about it, I'll add Josh to it.

DrDaveD commented 1 year ago

I investigated one of Stuart's hung nodes today. Unfortunately they have a cleanup process in place that mostly succeeded. They are going to disable that for future investigations.

There were a lot of hung umount /cvmfs/ligo.storage.igwn.org processes running from their cleanup process. There were no functioning cvmfs2 processes, just one that was defunct even though it was a child of pid 1. I was not able to get rid of it even by doing systemctl daemon-reexec. I did however find a way to make it irrelevant, by removing all the files in /var/run/cvmfs/* and /var/lib/cvmfs/osgstorage/shared/*ligo.storage*. Now the system is behaving normally without a reboot. It just has that defunct cvmfs2 process and its child, a defunct cvmfs_scitoken_helper which I killed after finding out it was just waiting on a read from stdin.

The last log messages related to ligo.storage.igwn.org were similar to those reported for other cases:

Nov 21 18:29:54 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switched to catalog revision 12154
Nov 21 18:31:50 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Nov 21 18:31:50 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Nov 21 18:32:01 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Nov 21 18:33:32 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf (hash: 2b19d47abf2fd31cc460c8790d35a914180fa251, error 17 [host data transfer cut short])
Nov 21 18:33:32 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf
Nov 21 18:48:43 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host data transfer cut short)
Nov 21 18:48:54 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Nov 21 18:48:55 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Nov 21 18:48:55 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf (hash: 31fdc73d412317eddb691e675c946c1cf69634e7, error 17 [host data transfer cut short])
Nov 21 18:48:55 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf
Nov 21 18:48:56 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host data transfer cut short)
Nov 21 18:50:56 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switched to catalog revision 12155
Nov 21 18:53:25 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Nov 21 18:54:06 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Nov 21 18:54:06 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Nov 21 18:54:07 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host data transfer cut short)
Nov 21 19:03:00 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Nov 21 19:03:12 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Nov 21 19:03:12 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Nov 21 19:03:13 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf (hash: 20f0cf8f62a538100edd1ef36e8d217182bb6724, error 17 [host data transfer cut short])
Nov 21 19:03:13 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf
Nov 21 19:03:13 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host data transfer cut short)
Nov 21 19:06:24 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Nov 21 19:06:24 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) Authorization for session 801203 disappeared
stuartthebruce commented 1 year ago

The LIGO CIT Condor pool now has all of its EP running with the following monit script temporarily removed,

[root@node507 ~]# cat /etc/monit.d/cvmfs.cfg 
# Refresh hung CVMFS mounts /cvmfs
#check program check_hung_cvmfs with path "/usr/bin/ls -d /cvmfs/singularity.opensciencegrid.org"
#   if status != 0 then restart
#   stop program = "/usr/bin/cvmfs_config wipecache"
#   start program = "/usr/bin/sleep 59" with timeout 60 seconds
# Refresh hung CVMFS mounts /cvmfs
check program check_hung_cvmfs with path "/usr/bin/cvmfs_config status"
  if status != 0 then restart
    stop program = "/usr/bin/cvmfs_config killall"
    start program = "/usr/bin/cvmfs_config probe"