Open stuartthebruce opened 1 year ago
Thank you, you did the right thing in reporting this the way you did. A lot of people have had this trouble and so far we haven't had any success with reproducing it or tracking down the root cause. The overall tracking issue is #3378.
This hang was apparently on October 25. The messages are no longer in grepcvmfsvarlogmessages.stdout but are in journalctlalmuautofsservice.stdout. Here are the last messages for that repo:
Oct 25 07:03:55 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switched to catalog revision 9606
Oct 25 07:17:12 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:17:12 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:17:22 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:17:33 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371361280-4096.gwf (hash: 4575229e452c097a347ae794b53f7c33483003f2, error 15 [host serving data too slowly])
Oct 25 07:17:33 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371361280-4096.gwf
Oct 25 07:17:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:18:16 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:18:16 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:18:26 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:18:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371361280-4096.gwf (hash: 7d55d5e8896267925caeffe58b59075c76e99ac6, error 15 [host serving data too slowly])
Oct 25 07:18:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371361280-4096.gwf
Oct 25 07:18:49 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:19:01 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:19:02 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:19:13 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf (hash: e26379a88ae44df48e09cff7fb028e3adc320c10, error 15 [host serving data too slowly])
Oct 25 07:19:13 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf
Oct 25 07:19:14 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:19:24 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:21:11 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host data transfer cut short)
Oct 25 07:21:11 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:21:21 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:21:32 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371365376-4096.gwf (hash: 9c033baf74a053e2569c30795b038ff1b04f2143, error 15 [host serving data too slowly])
Oct 25 07:21:32 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1371365376-4096.gwf
Oct 25 07:21:40 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:21:50 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:21:51 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:22:01 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf (hash: e2cc8e52e65247de9ccb95cb0d28f10e067d0bca, error 15 [host serving data too slowly])
Oct 25 07:22:01 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf
Oct 25 07:22:09 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:22:20 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:23:22 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:23:22 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:23:32 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:23:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf (hash: b660415ce1fad446259bb2cf7dd7278561058055, error 15 [host serving data too slowly])
Oct 25 07:23:43 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf
Oct 25 07:23:46 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:24:21 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Oct 25 07:24:21 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Oct 25 07:24:31 node538 cvmfs2[650307]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf (hash: b660415ce1fad446259bb2cf7dd7278561058055, error 15 [host serving data too slowly])
Oct 25 07:24:31 node538 cvmfs2[650307]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00_AR/H1/H-H1_HOFT_C00_AR-137/H-H1_HOFT_C00_AR-1376440320-4096.gwf
Oct 25 07:24:35 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Oct 25 07:24:47 node538 cvmfs2[650307]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Oct 25 07:24:47 node538 cvmfs2[650307]: (ligo.storage.igwn.org) No auth token found in returned JSON from Authz helper /cvmfs/config-osg.opensciencegrid.org/libexec/authz/cvmfs_scitoken_helper
This is similar to the messages seen in other cases, although I think this is the first case of an OSG site; I think the others were all in Europe. There was one case that was an unauthenticated repository so we have been trying to reproduce it with one of those because it's simpler, but most of them have been a LIGO repository so we probably need to expand to trying to reproduce it with authentication. #3375 also had a message about "No auth token found", maybe that's important.
Is there any way you could give temporary access to one of these hung machines to someone on the cvmfs development team, or me?
If you have more than one it would be interesting to see if you can kill the relevant cvmfs2 process and get the machine unhung. Other people have reported that this problem can only be cleared up with a reboot.
I'll note here that it's strange to me that this node, which is in Pasadena, was attempting to pull data from Kansas City, Denver, and Lincoln, rather than our cache in Pasadena, or others in southern California. That surely didn't help, though I don't claim it's the real cause.
I'll note here that it's strange to me that this node, which is in Pasadena, was attempting to pull data from Kansas City, Denver, and Lincoln, rather than our cache in Pasadena, or others in southern California.
Please file an OSG ticket about that. It's probably a Maxmind database issue, which Fabio knows how to fix.
I'll note here that it's strange to me that this node, which is in Pasadena, was attempting to pull data from Kansas City, Denver, and Lincoln, rather than our cache in Pasadena, or others in southern California.
Please file an OSG ticket about that. It's probably a Maxmind database issue, which Fabio knows how to fix.
Multiple attempts have been made to get MaxMind to update their GeoIP database.
Is there any way you could give temporary access to one of these hung machines to someone on the cvmfs development team, or me?
Yes, and I will move that discussion to email.
If you have more than one it would be interesting to see if you can kill the relevant cvmfs2 process and get the machine unhung. Other people have reported that this problem can only be cleared up with a reboot.
We had several and were not able to unwedge them with kill -9.
@stuartthebruce Great! please do include me as well.
Multiple attempts have been made to get MaxMind to update their GeoIP database.
This is the first time I have ever heard that a request to update the MaxMind DB had a problem. In my experience they have always applied a requested change to their next update in their 2 week cycle. I would like to know the details of who tried what when.
Multiple attempts have been made to get MaxMind to update their GeoIP database.
This is the first time I have ever heard that a request to update the MaxMind DB had a problem. In my experience they have always applied a requested change to their next update in their 2 week cycle. I would like to know the details of who tried what when.
The CIT HEP group requested ultralight.org be relocated from Kansas to Pasadena starting on Oct 2.
And I believe that they did that through this link: https://www.maxmind.com/en/geoip-data-correction-request, as we were instructed to do.
Let's take the geoIP issue out of this ticket because it's not related. I'm on an email thread with Stuart about it, I'll add Josh to it.
I investigated one of Stuart's hung nodes today. Unfortunately they have a cleanup process in place that mostly succeeded. They are going to disable that for future investigations.
There were a lot of hung umount /cvmfs/ligo.storage.igwn.org
processes running from their cleanup process. There were no functioning cvmfs2 processes, just one that was defunct even though it was a child of pid 1. I was not able to get rid of it even by doing systemctl daemon-reexec
. I did however find a way to make it irrelevant, by removing all the files in /var/run/cvmfs/*
and /var/lib/cvmfs/osgstorage/shared/*ligo.storage*
. Now the system is behaving normally without a reboot. It just has that defunct cvmfs2 process and its child, a defunct cvmfs_scitoken_helper
which I killed after finding out it was just waiting on a read from stdin.
The last log messages related to ligo.storage.igwn.org were similar to those reported for other cases:
Nov 21 18:29:54 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switched to catalog revision 12154
Nov 21 18:31:50 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Nov 21 18:31:50 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Nov 21 18:32:01 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host serving data too slowly)
Nov 21 18:33:32 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf (hash: 2b19d47abf2fd31cc460c8790d35a914180fa251, error 17 [host data transfer cut short])
Nov 21 18:33:32 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf
Nov 21 18:48:43 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host data transfer cut short)
Nov 21 18:48:54 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Nov 21 18:48:55 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Nov 21 18:48:55 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf (hash: 31fdc73d412317eddb691e675c946c1cf69634e7, error 17 [host data transfer cut short])
Nov 21 18:48:55 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf
Nov 21 18:48:56 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host data transfer cut short)
Nov 21 18:50:56 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switched to catalog revision 12155
Nov 21 18:53:25 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Nov 21 18:54:06 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Nov 21 18:54:06 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Nov 21 18:54:07 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host data transfer cut short)
Nov 21 19:03:00 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Nov 21 19:03:12 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://unl-cache.nationalresearchplatform.org:8443/ to https://xrootd-local.unl.edu:1094/ (host serving data too slowly)
Nov 21 19:03:12 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://xrootd-local.unl.edu:1094/ to https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ (host returned HTTP error)
Nov 21 19:03:13 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) failed to fetch /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf (hash: 20f0cf8f62a538100edd1ef36e8d217182bb6724, error 17 [host data transfer cut short])
Nov 21 19:03:13 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) EIO (05) on /igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1372446720-4096.gwf
Nov 21 19:03:13 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/ to https://dtn-pas.denv.nrp.internet2.edu:8443/ (host data transfer cut short)
Nov 21 19:06:24 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) switching host from https://dtn-pas.denv.nrp.internet2.edu:8443/ to https://unl-cache.nationalresearchplatform.org:8443/ (host serving data too slowly)
Nov 21 19:06:24 node507.cluster.ldas.cit cvmfs2[3637058]: (ligo.storage.igwn.org) Authorization for session 801203 disappeared
The LIGO CIT Condor pool now has all of its EP running with the following monit
script temporarily removed,
[root@node507 ~]# cat /etc/monit.d/cvmfs.cfg # Refresh hung CVMFS mounts /cvmfs #check program check_hung_cvmfs with path "/usr/bin/ls -d /cvmfs/singularity.opensciencegrid.org" # if status != 0 then restart # stop program = "/usr/bin/cvmfs_config wipecache" # start program = "/usr/bin/sleep 59" with timeout 60 seconds # Refresh hung CVMFS mounts /cvmfs check program check_hung_cvmfs with path "/usr/bin/cvmfs_config status" if status != 0 then restart stop program = "/usr/bin/cvmfs_config killall" start program = "/usr/bin/cvmfs_config probe"
Hung /cvmfs mounts on multiple RL8.8 system, e.g.,
which persist even after running,
Please find attached the output of
cvmfs_config bugreport
, but note at several points I need to kill hung processes, e.g., /bin/df, for the report generation to continue,cvmfs-bugreport.tar.gz