dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

Error relaying data: org.eclipse.jetty.io.EofException #6389

Closed cfgamboa closed 4 months ago

cfgamboa commented 2 years ago

First of all, Happy new year!

====

dCache door on 7.2.7, pools 7.2.3

The following request is failing with same type of error on the transfer elements part of the transfer. org.eclipse.jetty.io.EofException:

Below the extract of the log for AAXUwotDwrg id,

DAV door

webdav2-dcdoor12_httpsDomain.access:level=ERROR ts=2022-01-04T11:08:43.986-0500 event=org.dcache.webdav.request request.method=GET request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data response.code=500 socket.remote=[2001:1458:301:27::100:76]:45624 user-agent=xrootd-tpc/v4.12.8 user.mapped=6435:31152,69907[fX8Blbv6:619bx4/E] transaction=door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg:1641306908778000 duration=5615214
webdav2-dcdoor12_httpsDomain.log:04 Jan 2022 11:08:43 (WebDAV2-dcdoor12-2) [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg] Internal server error: org.dcache.webdav.WebDavException: Failed to send entity: client closed connection
webdav2-dcdoor12_httpsDomain.log:04 Jan 2022 11:08:43 (WebDAV2-dcdoor12-2) [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg] Failed to write error response Internal problem: Failed to send entity: client closed connection: org.eclipse.jetty.io.EofException: Closed

Pool

04 Jan 2022 11:08:43 [dc202_38-11295] [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg WebDAV2-dcdoor12-2 PoolMoverKill] Killing mover 117470757: door experienced error relaying data: org.eclipse.jetty.io.EofException
04 Jan 2022 11:08:43 [http-net-3] [] Connection [id: 0x05a92d6a, L:0.0.0.0/0.0.0.0:22560 ! R:/10.42.38.65:44028] unexpectedly closed.
04 Jan 2022 11:08:43 [dc202_38-11295] [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg WebDAV2-dcdoor12-2 PoolDeliverFile 00000878F923DD89448486BFC6B9843743C7] Transfer forcefully killed: Active transfer cancelled: door experienced error relaying data: org.eclipse.jetty.io.EofException

Please advise if this is something already reported or if more information is needed.

Carlos

paulmillar commented 2 years ago

Happy New Year Carlos -- all the best for 2022!

Here are some initial observations.

The client issued an HTTP GET request. The User-Agent string was xrootd-tpc/v4.12.8, so this looks like an HTTP-TPC transfer with an xrootd v4.12.8 as the destination.

The door was relaying the transfer, rather than redirecting to the pool, which makes sense given BNL's firewall policy.

The error comes from the client disconnected while dCache was trying to send the data. This is rather an odd thing for the client to do (it asked for the data, after all). It might indicate a problem somewhere. However, perhaps dCache is too noisy in reporting what happened.

The door took just over 93 minutes to process the request. It's a guess, but this looks suspiciously like some kind of 90 minute deadline within which the transfer must complete. This deadline could have come from the xrootd server, or possibly from FTS (if FTS was driving the transfer).

Unfortunately, there doesn't seem to be enough information to learn how large was the file or how many bytes were transferred: the namespace and/or billing entries should provide this information.

So, it might be worth investigating to learn whether dCache was being particularly slow at transferring the file (and, if so, why).

Other that this, I think the problem here is dCache is just too noisy if the client disconnects during an HTTP transfer.

cfgamboa commented 2 years ago

Hello Paul!

Thank you for the followup. Indeed, bellow the billing info, which seems to be consistent to your thoughts; at the transfer time the file was available in the pool. The log showed that the file was restored way earlier.

**File is successfully restored** 
billing-2022.01.04:01.04 00:09:49 [pool:dc202_38@dc202thirtyeightDomain:restore] [00000878F923DD89448486BFC6B9843743C7,2621600220] [Unknown] bnlt1d0:BNLT1D0@osm 16826095 0 {0:""}

**Log related to the transfer**
billing-2022.01.04:01.04 11:08:43 [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:request] ["usatlas1":6435:31152:2001:1458:301:27:0:0:100:76] [00000878F923DD89448486BFC6B9843743C7,2621600220] [/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data] bnlt1d0:BNLT1D0@osm 5615181 0 {10011:"Error relaying data: org.eclipse.jetty.io.EofException"}
billing-2022.01.04:01.04 11:08:43 [pool:dc202_38:transfer] [00000878F923DD89448486BFC6B9843743C7,2621600220] [/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data] bnlt1d0:BNLT1D0@osm 1540767744 5615173 false {Http-1.1:192.12.15.233:0:WebDAV2-dcdoor12-2:webdav2-dcdoor12_httpsDomain:/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data} [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg:1641306908778000] {666:"Transfer forcefully killed: Active transfer cancelled: door experienced error relaying data: org.eclipse.jetty.io.EofException"}
billing-error-2022.01.04:01.04 11:08:43 [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:request] ["usatlas1":6435:31152:2001:1458:301:27:0:0:100:76] [00000878F923DD89448486BFC6B9843743C7,2621600220] [/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data] bnlt1d0:BNLT1D0@osm 5615181 0 {10011:"Error relaying data: org.eclipse.jetty.io.EofException"}
billing-error-2022.01.04:01.04 11:08:43 [pool:dc202_38:transfer] [00000878F923DD89448486BFC6B9843743C7,2621600220] [/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data] bnlt1d0:BNLT1D0@osm 1540767744 5615173 false {Http-1.1:192.12.15.233:0:WebDAV2-dcdoor12-2:webdav2-dcdoor12_httpsDomain:/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data} [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg:1641306908778000] {666:"Transfer forcefully killed: Active transfer cancelled: door experienced error relaying data: org.eclipse.jetty.io.EofException"}
paulmillar commented 2 years ago

Hi Carlos,

Thanks for the billing information. I've extracted some relevant information here:

File size: 2621600220 (~2.4 GiB) Transferred data: 1540767744 (~1.4 GiB, 58.8% of the file) Connection time: 93 minutes

... from which it's easy to calculate:

Average bandwidth: 268 KiB/s Expected transfer time: 2.6 hours

Assuming a 90-minute timeout ...

Required (avr) bandwidth: 475 KiB/s

So, the transfer was simply progressing too slowly and xrootd or FTS timed out and cancelled the transfer by disconnecting.

This raises the question: why was the transfer so slow?

It could be something on the dCache size (e.g., the disk being overloaded), a problem with the networks somewhere, or a problem at the remote site.

Unfortunately, this is something where there's currently not enough information in dCache.

For GridFTP transfers, we log detailed ("forensic", perhaps) information if the client aborts a transfer. Unfortunately, this is currently not available for other transfer protocols (it's on my TODO list).