Closed cfgamboa closed 4 months ago
Happy New Year Carlos -- all the best for 2022!
Here are some initial observations.
The client issued an HTTP GET request. The User-Agent string was xrootd-tpc/v4.12.8
, so this looks like an HTTP-TPC transfer with an xrootd v4.12.8 as the destination.
The door was relaying the transfer, rather than redirecting to the pool, which makes sense given BNL's firewall policy.
The error comes from the client disconnected while dCache was trying to send the data. This is rather an odd thing for the client to do (it asked for the data, after all). It might indicate a problem somewhere. However, perhaps dCache is too noisy in reporting what happened.
The door took just over 93 minutes to process the request. It's a guess, but this looks suspiciously like some kind of 90 minute deadline within which the transfer must complete. This deadline could have come from the xrootd server, or possibly from FTS (if FTS was driving the transfer).
Unfortunately, there doesn't seem to be enough information to learn how large was the file or how many bytes were transferred: the namespace and/or billing entries should provide this information.
So, it might be worth investigating to learn whether dCache was being particularly slow at transferring the file (and, if so, why).
Other that this, I think the problem here is dCache is just too noisy if the client disconnects during an HTTP transfer.
Hello Paul!
Thank you for the followup. Indeed, bellow the billing info, which seems to be consistent to your thoughts; at the transfer time the file was available in the pool. The log showed that the file was restored way earlier.
**File is successfully restored**
billing-2022.01.04:01.04 00:09:49 [pool:dc202_38@dc202thirtyeightDomain:restore] [00000878F923DD89448486BFC6B9843743C7,2621600220] [Unknown] bnlt1d0:BNLT1D0@osm 16826095 0 {0:""}
**Log related to the transfer**
billing-2022.01.04:01.04 11:08:43 [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:request] ["usatlas1":6435:31152:2001:1458:301:27:0:0:100:76] [00000878F923DD89448486BFC6B9843743C7,2621600220] [/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data] bnlt1d0:BNLT1D0@osm 5615181 0 {10011:"Error relaying data: org.eclipse.jetty.io.EofException"}
billing-2022.01.04:01.04 11:08:43 [pool:dc202_38:transfer] [00000878F923DD89448486BFC6B9843743C7,2621600220] [/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data] bnlt1d0:BNLT1D0@osm 1540767744 5615173 false {Http-1.1:192.12.15.233:0:WebDAV2-dcdoor12-2:webdav2-dcdoor12_httpsDomain:/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data} [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg:1641306908778000] {666:"Transfer forcefully killed: Active transfer cancelled: door experienced error relaying data: org.eclipse.jetty.io.EofException"}
billing-error-2022.01.04:01.04 11:08:43 [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:request] ["usatlas1":6435:31152:2001:1458:301:27:0:0:100:76] [00000878F923DD89448486BFC6B9843743C7,2621600220] [/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data] bnlt1d0:BNLT1D0@osm 5615181 0 {10011:"Error relaying data: org.eclipse.jetty.io.EofException"}
billing-error-2022.01.04:01.04 11:08:43 [pool:dc202_38:transfer] [00000878F923DD89448486BFC6B9843743C7,2621600220] [/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data] bnlt1d0:BNLT1D0@osm 1540767744 5615173 false {Http-1.1:192.12.15.233:0:WebDAV2-dcdoor12-2:webdav2-dcdoor12_httpsDomain:/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/RAW/other/data18_13TeV.00359717.physics_Main.daq.RAW/data18_13TeV.00359717.physics_Main.daq.RAW._lb0461._SFO-6._0001.data} [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXUwotDwrg:1641306908778000] {666:"Transfer forcefully killed: Active transfer cancelled: door experienced error relaying data: org.eclipse.jetty.io.EofException"}
Hi Carlos,
Thanks for the billing information. I've extracted some relevant information here:
File size: 2621600220 (~2.4 GiB) Transferred data: 1540767744 (~1.4 GiB, 58.8% of the file) Connection time: 93 minutes
... from which it's easy to calculate:
Average bandwidth: 268 KiB/s Expected transfer time: 2.6 hours
Assuming a 90-minute timeout ...
Required (avr) bandwidth: 475 KiB/s
So, the transfer was simply progressing too slowly and xrootd or FTS timed out and cancelled the transfer by disconnecting.
This raises the question: why was the transfer so slow?
It could be something on the dCache size (e.g., the disk being overloaded), a problem with the networks somewhere, or a problem at the remote site.
Unfortunately, this is something where there's currently not enough information in dCache.
For GridFTP transfers, we log detailed ("forensic", perhaps) information if the client aborts a transfer. Unfortunately, this is currently not available for other transfer protocols (it's on my TODO list).
First of all, Happy new year!
====
dCache door on 7.2.7, pools 7.2.3
The following request is failing with same type of error on the transfer elements part of the transfer.
org.eclipse.jetty.io.EofException:
Below the extract of the log for AAXUwotDwrg id,
DAV door
Pool
Please advise if this is something already reported or if more information is needed.
Carlos