Open DmitryLitvintsev opened 2 years ago
Result of
tcpdump host dunegpvm09 executed while copy was 'running' is attached. Not much there.
tcp.dump.gz On a pool node I see connection to the client host:
tcp6 0 400 stkendca1909.fnal:29938 dunegpvm09.fnal.gov:768 ESTABLISHED 22610/java
22610 is PID of the pool process. Pool has no movers associated.
On the client I see connection to the pool :
tcp 0 612 131.225.67.242:768 131.225.69.112:29938 ESTABLISHED -
A lot of data is pumped on that connection:
dunegpvm09.fnal.gov => stkendca1909.fnal.gov 28.2Mb 26.9Mb 26.9Mb
the tcpdump produced by this command on the client:
tcpdump -w tcp.dump host stkendca1909
is attached.
Restart of client (or pool) apparently kills that connection and file can be copied again!
This issue is affecting us badly.
According the dump, pool returns BAD_STATEID
which in in dCache language 'no such mover. Above, you have mentioned that no transfer or mover was started (that explains BAD_STATEID). The question is why client think that it can read the file without issuing LAYOUTGET :man_shrugging: . Please ensure that export files contain option
lt=flex_files`. At least this is will active kernel module that have better support.
dunegpvm09(rw,lt=nfsv4_1_files:flex_files)
is in /etc/exports
We do not know what is the cause and effect here.
An attempt to run cp
on a file ends up with no interaction with pool (and the door it looks like).
This seems to be caused by this pre-existing connection :
tcp6 0 400 stkendca1909.fnal:29938 dunegpvm09.fnal.gov:768 ESTABLISHED 22610/java
How this connection was created is unknown. Just one thing of note. The port 768
sounds familiar. Like , I have seen the same port used in the past in similar circumstances.
There is no transfer or mover associated with this connection. Yet it is there. Left behind from some previous interaction?
Does the NFS door's access log file provide any useful information on the client interaction?
This should contain (almost) all client interactions that led up to this problem.
In 7.2:
on one host:
saw mover on pool :+1:
from two other hosts:
just stays there no data movement
No mover on pool. No transfers: