dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

NFS: ls on directory hangs, dCache 7.2 #6481

Open DmitryLitvintsev opened 2 years ago

DmitryLitvintsev commented 2 years ago

dCache 7.2 client : Scientific Linux release 7.9 (Nitrogen)

[litvinse@mu2egpvm01 ~]$  ls /pnfs/mu2e/persistent/users/mu2epro/valjob/2022/02/09/summary/
^C

Hangs.

ls on parent works:

[litvinse@mu2egpvm01 ~]$ time ls  /pnfs/mu2e/persistent/users/mu2epro/valjob/2022/02/09/
ceMix  ceSimReco  cosmicSimReco  pileup  potSim  reco  stops  summary

real    0m0.167s
user    0m0.002s
sys 0m0.005s
[litvinse@mu2egpvm01 ~]$ 

Anything short of rebooting client machine ? (killing client from admin shell does not help)

Enabling NFS debug does not give any useful info (to me):

Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: NFS: nfs_pgio_result: 30950, (status -10025)
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: NFS: nfs_pgio_result: 39361, (status -10025)
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: nfs41_sequence_process: Error 0 free the slot 
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: nfs41_sequence_process: Error 0 free the slot 
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: NFS: nfs_pgio_result: 42675, (status -10025)
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: NFS: nfs_pgio_result: 41715, (status -10025)
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: nfs41_sequence_process: Error 0 free the slot 
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: nfs41_sequence_process: Error 0 free the slot 
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
paulmillar commented 2 years ago

Capturing the NFS traffic between the client and server (using a tool like tcpdump) is a good bet. This might (or might not) provide some additional insight into what's going wrong.

You could also enable debug logging on the org.dcache.nfs.v4.OperationREADDIR logger in the NFS door. The admin command should be:

log set stdout org.dcache.nfs.v4.OperationREADDIR DEBUG
DmitryLitvintsev commented 2 years ago

nfs.dump.gz

Here we go, dcpdump attached

 tcpdump -i eth0 -w /tmp/nfs.dump host mu2egpvm01
paulmillar commented 2 years ago

OK, quick summary: the client seems to be scanning through the directory response and requesting file attributes (GETATTR) for the contents of the directory. AFAIK, this is normal behaviour for an NFS client.

Each request is a compound request but that, too, is normal (AFAIK).

The responses seems OK (to my eyes), but the client keeps asking about just two directory items, cycling through them. The client asks about FileHandle-A, then FileHandle-B, then FileHandle-A, then FileHandle-B, and so on.

Ignoring the possibility of this directory having a large number of hard-links of the same two files (very unlikely), it looks like either the client is caught in a loop or dCache's directly listing went mad.

If dCache's directory listing was at fault (caught in a loop) then why would that loop terminate, unless the client has some internal limit after which it just ignores further directory items.

I think we need @kofemann's expertise to go any further here.