dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

java.lang.NullPointerException: null error on WebDAV Domain due to jetty 9.4.51 bug #7631

Closed geonmo closed 2 months ago

geonmo commented 4 months ago

Dear dCache staff,

I found some errors on a WebDAV domain.

2024-08-04T00:36:58.908509+09:00 cms-t2-se01.sdfarm.kr dcache@WebDavDomain[488477]: java.lang.NullPointerException: null
2024-08-04T00:36:58.908509+09:00 cms-t2-se01.sdfarm.kr dcache@WebDavDomain[488477]: #011at org.eclipse.jetty.io.AbstractConnection.lambda$failedCallback$0(AbstractConnection.java:92)
2024-08-04T00:36:58.908509+09:00 cms-t2-se01.sdfarm.kr dcache@WebDavDomain[488477]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
2024-08-04T00:36:58.908509+09:00 cms-t2-se01.sdfarm.kr dcache@WebDavDomain[488477]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
2024-08-04T00:36:58.908509+09:00 cms-t2-se01.sdfarm.kr dcache@WebDavDomain[488477]: #011at java.base/java.lang.Thread.run(Thread.java:829)

I don't know if it's due to this bug, but if the number of failed transfers at the WebDAV door increases, at some point the WebDAV domain will stop working and even basic functions like serving SSL certificates won't work.

I googled the error and found that it occurred in jetty 9.4.51 and was fixed in 9.4.52. (https://github.com/jetty/jetty.project/issues/9476)

However, I understand that dCache v9.2 uses jetty 9.4.51.

I strongly request that the jetty version upgrade in the next release of dCache v9.2.

Please also check whether this bug can cause any abnormal behavior of a WebDAV domain.

The webdav hang error has significantly reduced the availability of our site.

Regards,

-- Geonmo

kofemann commented 3 months ago

Hi @geonmo ,

Thanks for reporting. Do you know how to reproduce the issue?

geonmo commented 3 months ago

Hello, @kofemann ,

The code to test jetty itself is in issue #9476. (https://github.com/jetty/jetty.project/issues/9476)

In my case, I recently experienced a data upload issue slowdown that brought the issue.

Currently on our site, roughly 450 uploads are consistently failing with timeouts, and the webdav domain is down every 8 hours, as short as 1 hour and as long as 8 hours. I'm assuming that not that many are failing because the timeout time (7200) is long.

I think it might be possible to make the timeout of the webdav transfer task smaller and then test it to keep failing.

geonmo commented 2 months ago

I've seen the release with the jetty version change, thank you for taking care of it. I'll close this issue.