dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
288 stars 136 forks source link

Problem NFS mounting dCache #3769

Open paulmillar opened 6 years ago

paulmillar commented 6 years ago

Observed the following problem with NFS hanging during mount:

INFO: task mount.nfs:11252 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.16.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.nfs     D ffff880028065ec0     0 11252      1 0x00000084
 ffff88015a2a7788 0000000000000286 ffffffff81fed640 0000000000000000
 ffff8801b37dd040 ffffffffa01a0bc0 ffff88015a2a76f8 ffffffffa019f498
 ffff88015a2a7718 ffffffff811b781f ffff8801fa607ad8 ffff88015a2a7fd8
Call Trace:
 [<ffffffffa01a0bc0>] ? rpc_clntdir_depopulate+0x0/0x20 [sunrpc]
 [<ffffffffa019f498>] ? rpc_destroy_inode+0x18/0x20 [sunrpc]
 [<ffffffff811b781f>] ? destroy_inode+0x2f/0x60
 [<ffffffff8154df2c>] ? _spin_unlock_irqrestore+0x1c/0x20
 [<ffffffffa01874d0>] ? xprt_wait_bit+0x0/0x20 [sunrpc]
 [<ffffffffa01874de>] xprt_wait_bit+0xe/0x20 [sunrpc]
 [<ffffffff8154baba>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811b3c2f>] ? d_free+0x3f/0x60
 [<ffffffffa01874d0>] ? xprt_wait_bit+0x0/0x20 [sunrpc]
 [<ffffffff8154bb98>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffffa018745e>] xprt_put+0x5e/0xd0 [sunrpc]
 [<ffffffffa0186221>] rpc_free_client+0x91/0xe0 [sunrpc]
 [<ffffffffa01862d8>] rpc_release_client+0x68/0xa0 [sunrpc]
 [<ffffffffa01866c7>] rpc_shutdown_client+0x57/0x100 [sunrpc]
 [<ffffffffa018661d>] ? rpc_ping+0x5d/0x70 [sunrpc]
 [<ffffffffa0186f9e>] rpc_create+0x46e/0x5b0 [sunrpc]
 [<ffffffff8117f91b>] ? cache_alloc_refill+0x15b/0x240
 [<ffffffffa0228ca8>] nfs_create_rpc_client+0xa8/0xf0 [nfs]
 [<ffffffffa0228e9a>] nfs4_init_client+0x6a/0x210 [nfs]
 [<ffffffffa022965b>] nfs_get_client+0x4bb/0x590 [nfs]
 [<ffffffffa02297a2>] nfs4_set_client+0x72/0xf0 [nfs]
 [<ffffffffa0190670>] ? __rpc_init_priority_wait_queue+0x80/0xb0 [sunrpc]
 [<ffffffffa022a4fb>] nfs4_create_server+0xbb/0x330 [nfs]
 [<ffffffff8118030d>] ? kmem_cache_alloc_trace+0xad/0x1c0
 [<ffffffffa0236e20>] nfs4_remote_get_sb+0x80/0x200 [nfs]
 [<ffffffff8119c81b>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffffa0237205>] nfs_do_root_mount+0x95/0xe0 [nfs]
 [<ffffffffa0237672>] nfs4_try_mount+0x52/0xd0 [nfs]
 [<ffffffffa023844a>] nfs_get_sb+0x43a/0x880 [nfs]
 [<ffffffff8119c81b>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8119c9c2>] do_kern_mount+0x52/0x130
 [<ffffffff811be8eb>] do_mount+0x2fb/0x930
 [<ffffffff811bc5d2>] ? copy_mount_options+0xf2/0x1a0
 [<ffffffff811befb0>] sys_mount+0x90/0xe0
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

The mount is on prometheus, using a freshly installed (and freshly started) dCache, using dCache master (at 2017-12-13).

This may have been the result of an earlier dCache shutting down without unmounting (due to problems will a broken PostgreSQL upgrade).

From the stack-trace, it looks like a bad response to rpc_ping triggers this. The NFS client then tries to disconnect without that working (for whatever reason).

paulmillar commented 6 years ago

There are two mount.nfs processes, one from (presumably) when the problem first appeared (2017-12-11). However, the problem persists when dCache comes back up again.

[root@prometheus ~]# ps aux|grep nfs
root      1700  0.0  0.0  16972  1076 ?        D    12:02   0:00 /sbin/mount.nfs localhost:/ /dcache -o rw,vers=4.1
root      3102  0.0  0.0 103324   844 pts/1    S+   12:08   0:00 grep nfs
root     11252  0.0  0.0  16972  1080 ?        D    Dec11   0:00 /sbin/mount.nfs localhost:/ /dcache -o rw,vers=4.1
root     12642  0.0  0.0      0     0 ?        S    Dec10   0:00 [nfsv4.1-svc]
root     17448  0.0  0.0      0     0 ?        S    Nov17   0:00 [nfsiod]
dcache   29975  0.0  0.0 106120   752 ?        S    11:59   0:00 /bin/sh /usr/share/dcache/lib/daemon -f -l -c /var/run/dcache.nfsServer-java.pid -p /var/run/dcache.nfsServer-daemon.pid -r /tmp/.dcache-stop.nfsServer -d 10 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm=DESY.DE -Djava.security.krb5.kdc=netra32.desy.de -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss-nfs.conf -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/nfsServer-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.8.10.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4001,suspend=n -Djdk.tls.ephemeralDHKeySize=matched -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start nfsServer
dcache   29979  3.2  3.1 4227148 254120 ?      Sl   11:59   0:18 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm=DESY.DE -Djava.security.krb5.kdc=netra32.desy.de -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss-nfs.conf -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/nfsServer-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.8.10.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4001,suspend=n -Djdk.tls.ephemeralDHKeySize=matched -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start nfsServer
[root@prometheus ~]# 
[root@prometheus ~]# dcache status
DOMAIN       STATUS                   PID   USER   LOG                              
dCacheDomain running (for 12 minutes) 29918 dcache /var/log/dcache/dCacheDomain.log 
nfsServer    running (for 12 minutes) 29979 dcache /var/log/dcache/nfsServer.log    
pools        stopped (for 12 minutes)       dcache /var/log/dcache/pools.log        
[root@prometheus ~]# 
DmitryLitvintsev commented 6 years ago

This is exactly the issue we have. We can't mount nfs on the server host (can mount on other hosts). We run 2.16.

Dmitry


From: Paul Millar notifications@github.com Sent: Wednesday, December 13, 2017 5:11 AM To: dCache/dcache Cc: Subscribed Subject: Re: [dCache/dcache] Problem NFS mounting dCache (#3769)

There are two mount.nfs processes, one from (presumably) when the problem first appeared (2017-12-11). However, the problem persists when dCache comes back up again.

[root@prometheus ~]# ps aux|grep nfs root 1700 0.0 0.0 16972 1076 ? D 12:02 0:00 /sbin/mount.nfs localhost:/ /dcache -o rw,vers=4.1 root 3102 0.0 0.0 103324 844 pts/1 S+ 12:08 0:00 grep nfs root 11252 0.0 0.0 16972 1080 ? D Dec11 0:00 /sbin/mount.nfs localhost:/ /dcache -o rw,vers=4.1 root 12642 0.0 0.0 0 0 ? S Dec10 0:00 [nfsv4.1-svc] root 17448 0.0 0.0 0 0 ? S Nov17 0:00 [nfsiod] dcache 29975 0.0 0.0 106120 752 ? S 11:59 0:00 /bin/sh /usr/share/dcache/lib/daemon -f -l -c /var/run/dcache.nfsServer-java.pid -p /var/run/dcache.nfsServer-daemon.pid -r /tmp/.dcache-stop.nfsServer -d 10 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm=DESY.DE -Djava.security.krb5.kdc=netra32.desy.de -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss-nfs.conf -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/nfsServer-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.8.10.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4001,suspend=n -Djdk.tls.ephemeralDHKeySize=matched -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start nfsServer dcache 29979 3.2 3.1 4227148 254120 ? Sl 11:59 0:18 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm=DESY.DE -Djava.security.krb5.kdc=netra32.desy.de -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss-nfs.conf -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/nfsServer-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.8.10.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4001,suspend=n -Djdk.tls.ephemeralDHKeySize=matched -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start nfsServer [root@prometheus ~]#

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/dCache/dcache/issues/3769#issuecomment-351360472, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB_qh6fUsZ27zXU7C4FJrQfEISGPxlhbks5s_7D3gaJpZM4RAX3w.

DmitryLitvintsev commented 6 years ago

here is, from our production server:

root@stkensrv1n ~]# ps auxwww | grep mount root 2708 0.0 0.0 103256 864 pts/11 S+ 09:54 0:00 grep mount root 7751 0.0 0.0 118560 14636 ? D Dec12 0:03 /sbin/mount.nfs localhost:/ /mnt -o rw,noexec,nosuid,nodev,sync,user,intr,bg,hard,vers=3 root 18108 0.0 0.0 118544 788 pts/3 D Sep20 0:00 /sbin/mount.nfs localhost:/fs /pnfs/fs -o rw,noexec,nosuid,nodev,user,noatime,vers=3,intr,bg,hard root 19263 0.0 0.0 122772 856 pts/3 D Sep20 0:00 /sbin/mount.nfs pnfs-stken:/fs /mnt -o rw,vers=3 root 31522 0.0 0.0 21196 832 ? D Sep20 0:00 /sbin/mount.nfs pnfs-stken:/eagle /pnfs/eagle -v -o rw,noexec,nosuid,nodev,sync,user,intr,bg,hard,vers=4,minorversion=1

paulmillar commented 6 years ago

Do you see the same stack-trace in the Linux kernel client, Dmitry?