dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
282 stars 135 forks source link

Uncaught exception in GFTP #6205

Open DmitryLitvintsev opened 2 years ago

DmitryLitvintsev commented 2 years ago

After upgrade to 7.2 the following exception is seen. I do not believe I have seen it before.

01 Nov 2021 14:37:05 (GFTP0-fndca4b-AAXPv01F4lA) [door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain] Uncaught exception in thread GFTP0-fndca4b-io-51
io.netty.handler.codec.TooLongFrameException: frame length (84173) exceeds the allowed maximum (65536)
        at io.netty.handler.codec.LineBasedFrameDecoder.fail(LineBasedFrameDecoder.java:158)
        at io.netty.handler.codec.LineBasedFrameDecoder.fail(LineBasedFrameDecoder.java:154)
        at io.netty.handler.codec.LineBasedFrameDecoder.decode(LineBasedFrameDecoder.java:141)
        at io.netty.handler.codec.LineBasedFrameDecoder.decode(LineBasedFrameDecoder.java:84)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at org.dcache.util.CDCThreadFactory.lambda$newThread$0(CDCThreadFactory.java:50)
        at java.base/java.lang.Thread.run(Thread.java:829)

It is always the same frame length : 84173 And it looks to be ~ 200 / day frequency.

alrossi commented 2 years ago
pipeline.addBefore(self, "frameDecoder", new LineBasedFrameDecoder(KiB.toBytes(64)));

in the NettyLineBasedDoor. Where does this max size come from? The question of course is why are we getting the oversized frame. That is, what would not be observing the 64K limit?

DmitryLitvintsev commented 2 years ago

access file for this interaction:

level=INFO ts=2021-11-01T14:37:05.685-0500 event=org.dcache.ftp.hello session=door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain socket.remote=131.225.67.13:43624 socket.local=131.225.69.121:2811 reply="220 GSI FTP door ready"
level=INFO ts=2021-11-01T14:37:05.687-0500 event=org.dcache.ftp.response session=door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain command="AUTH GSSAPI" reply="334 ADAT must follow"
level=INFO ts=2021-11-01T14:37:05.702-0500 event=org.dcache.ftp.response session=door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-11-01T14:37:05.709-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:43624 session=door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain
paulmillar commented 2 years ago

Some ideas/questions:

  1. Are these problems coming from that same remote host (or a limited set of remote hosts) ?
  2. Assuming yes, are all transfers from that host (or "those hosts") always failing?
  3. Can we capture the network traffic for one of these failures, to see what the client is sending?
  4. Can we find the user responsible for these failures and ask them what they're doing?
DmitryLitvintsev commented 2 years ago

Excellent suggestions, Paul, as always. The Issue has been created as a place holder to go back to if team has no immediate ideas what that might be. It is bloodbath here at Fermilab since upgrade. Once I have a little more time to dedicate to this issue all these avenues will be explored.

DmitryLitvintsev commented 2 years ago
 doors=`zgrep -B1 'TooLongFrameException' gridftp0-fndca4bDomain.log-20211102.gz  | grep GFTP  | awk '{ print $5}' | sed -e 's/[\(\)]//g'`

[root@fndca4b dcache-log]# for d in $doors ; do  zgrep "$d" gridftp0-fndca4bDomain.access.2021-11-01.gz | grep "host"; done
level=INFO ts=2021-11-01T21:00:05.423-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47088 session=door:GFTP0-fndca4b-AAXPxKb4KDA@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:00:23.555-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47090 session=door:GFTP0-fndca4b-AAXPxKgNGqA@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:00:24.592-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47092 session=door:GFTP0-fndca4b-AAXPxKgc5Zg@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:00:24.625-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47094 session=door:GFTP0-fndca4b-AAXPxKgdamg@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:00:37.765-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47100 session=door:GFTP0-fndca4b-AAXPxKjl4rg@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:02:25.089-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47114 session=door:GFTP0-fndca4b-AAXPxK9LgTA@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:03:56.306-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47116 session=door:GFTP0-fndca4b-AAXPxLS7UeA@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:03:57.339-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47118 session=door:GFTP0-fndca4b-AAXPxLTLNEg@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:03:57.375-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47120 session=door:GFTP0-fndca4b-AAXPxLTLsUg@gridftp0-fndca4bDomain

same host.

There are successful transfers :

billing=# select sum(case when errorcode <> 0 then 1 else 0 end) as errors, sum(case when errorcode <> 0 then 0 else 1 end) as success, protocol from billinginfo where client='131.225.67.13' and datestamp > current_date - interval '2 days'  group by protocol;
 errors | success |  protocol  
--------+---------+------------
      0 |     537 | NFS4-4.1
      0 |     991 | Xrootd-5.0
(2 rows)

This seems like interactive node. I will ask user what are they doing.