Open paulmillar opened 6 years ago
There are two mount.nfs
processes, one from (presumably) when the problem first appeared (2017-12-11). However, the problem persists when dCache comes back up again.
[root@prometheus ~]# ps aux|grep nfs
root 1700 0.0 0.0 16972 1076 ? D 12:02 0:00 /sbin/mount.nfs localhost:/ /dcache -o rw,vers=4.1
root 3102 0.0 0.0 103324 844 pts/1 S+ 12:08 0:00 grep nfs
root 11252 0.0 0.0 16972 1080 ? D Dec11 0:00 /sbin/mount.nfs localhost:/ /dcache -o rw,vers=4.1
root 12642 0.0 0.0 0 0 ? S Dec10 0:00 [nfsv4.1-svc]
root 17448 0.0 0.0 0 0 ? S Nov17 0:00 [nfsiod]
dcache 29975 0.0 0.0 106120 752 ? S 11:59 0:00 /bin/sh /usr/share/dcache/lib/daemon -f -l -c /var/run/dcache.nfsServer-java.pid -p /var/run/dcache.nfsServer-daemon.pid -r /tmp/.dcache-stop.nfsServer -d 10 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm=DESY.DE -Djava.security.krb5.kdc=netra32.desy.de -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss-nfs.conf -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/nfsServer-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.8.10.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4001,suspend=n -Djdk.tls.ephemeralDHKeySize=matched -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start nfsServer
dcache 29979 3.2 3.1 4227148 254120 ? Sl 11:59 0:18 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm=DESY.DE -Djava.security.krb5.kdc=netra32.desy.de -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss-nfs.conf -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/nfsServer-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.8.10.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4001,suspend=n -Djdk.tls.ephemeralDHKeySize=matched -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start nfsServer
[root@prometheus ~]#
[root@prometheus ~]# dcache status
DOMAIN STATUS PID USER LOG
dCacheDomain running (for 12 minutes) 29918 dcache /var/log/dcache/dCacheDomain.log
nfsServer running (for 12 minutes) 29979 dcache /var/log/dcache/nfsServer.log
pools stopped (for 12 minutes) dcache /var/log/dcache/pools.log
[root@prometheus ~]#
This is exactly the issue we have. We can't mount nfs on the server host (can mount on other hosts). We run 2.16.
Dmitry
From: Paul Millar notifications@github.com Sent: Wednesday, December 13, 2017 5:11 AM To: dCache/dcache Cc: Subscribed Subject: Re: [dCache/dcache] Problem NFS mounting dCache (#3769)
There are two mount.nfs processes, one from (presumably) when the problem first appeared (2017-12-11). However, the problem persists when dCache comes back up again.
[root@prometheus ~]# ps aux|grep nfs root 1700 0.0 0.0 16972 1076 ? D 12:02 0:00 /sbin/mount.nfs localhost:/ /dcache -o rw,vers=4.1 root 3102 0.0 0.0 103324 844 pts/1 S+ 12:08 0:00 grep nfs root 11252 0.0 0.0 16972 1080 ? D Dec11 0:00 /sbin/mount.nfs localhost:/ /dcache -o rw,vers=4.1 root 12642 0.0 0.0 0 0 ? S Dec10 0:00 [nfsv4.1-svc] root 17448 0.0 0.0 0 0 ? S Nov17 0:00 [nfsiod] dcache 29975 0.0 0.0 106120 752 ? S 11:59 0:00 /bin/sh /usr/share/dcache/lib/daemon -f -l -c /var/run/dcache.nfsServer-java.pid -p /var/run/dcache.nfsServer-daemon.pid -r /tmp/.dcache-stop.nfsServer -d 10 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm=DESY.DE -Djava.security.krb5.kdc=netra32.desy.de -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss-nfs.conf -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/nfsServer-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.8.10.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4001,suspend=n -Djdk.tls.ephemeralDHKeySize=matched -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start nfsServer dcache 29979 3.2 3.1 4227148 254120 ? Sl 11:59 0:18 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm=DESY.DE -Djava.security.krb5.kdc=netra32.desy.de -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss-nfs.conf -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/nfsServer-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.8.10.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4001,suspend=n -Djdk.tls.ephemeralDHKeySize=matched -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start nfsServer [root@prometheus ~]#
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/dCache/dcache/issues/3769#issuecomment-351360472, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB_qh6fUsZ27zXU7C4FJrQfEISGPxlhbks5s_7D3gaJpZM4RAX3w.
here is, from our production server:
root@stkensrv1n ~]# ps auxwww | grep mount root 2708 0.0 0.0 103256 864 pts/11 S+ 09:54 0:00 grep mount root 7751 0.0 0.0 118560 14636 ? D Dec12 0:03 /sbin/mount.nfs localhost:/ /mnt -o rw,noexec,nosuid,nodev,sync,user,intr,bg,hard,vers=3 root 18108 0.0 0.0 118544 788 pts/3 D Sep20 0:00 /sbin/mount.nfs localhost:/fs /pnfs/fs -o rw,noexec,nosuid,nodev,user,noatime,vers=3,intr,bg,hard root 19263 0.0 0.0 122772 856 pts/3 D Sep20 0:00 /sbin/mount.nfs pnfs-stken:/fs /mnt -o rw,vers=3 root 31522 0.0 0.0 21196 832 ? D Sep20 0:00 /sbin/mount.nfs pnfs-stken:/eagle /pnfs/eagle -v -o rw,noexec,nosuid,nodev,sync,user,intr,bg,hard,vers=4,minorversion=1
Do you see the same stack-trace in the Linux kernel client, Dmitry?
Observed the following problem with NFS hanging during mount:
The mount is on prometheus, using a freshly installed (and freshly started) dCache, using dCache master (at 2017-12-13).
This may have been the result of an earlier dCache shutting down without unmounting (due to problems will a broken PostgreSQL upgrade).
From the stack-trace, it looks like a bad response to
rpc_ping
triggers this. The NFS client then tries to disconnect without that working (for whatever reason).