Open chaen opened 4 months ago
Hi @chaen,
If I may suggest, please urgently ask Adrien to keep this xrootd door's access log file on 2024-07-06, from 08:52:38.584+02 to 09:30:16.004+02.
The access log files are often rotated (and delete) relatively quickly, so being prompt with this request would be wise.
Cheers, Paul.
Hi,
I checked the xrootd access logs (in attachment) but I couldn't find anything relevant except no such file
msg.
I think the default log level is not enough verbose to get details about those requests.
xrootd-ccdcacli434Domain.access.2024-07-06.gz
On dCache side, the xrootd queue on pools were full with active requests and hundreds of xrootd transfers on queue.
I don't know how the transfer timeout is managed on client side but I saw lot of Transfer interrupted
and No connection from client after 300 seconds. Giving up.
errors.
Adrien
The worrying part is, as @ageorget says, there's no mention in the xrootd door's access log file for either the failed upload attempt (at ~08:52:38) or the problematic delete (at ~09:30:16).
The filename 00220520_00166463_1.qee.dst
only appears as 13 failed stat requests, from 09:40:33 to 09:56:18.
The access log file is recording at INFO level (which is good), so it should capture all successful and failed client requests, so I do not understand why billing is reporting door activity that the door does not report as having any corresponding client activity.
Does the domain log file (for the domain hosting the xrootd door) indicate any errors/problems at around this time?
In the XRootD door log file, I only found some SSL handshake timeout but not at the exact timestamp of the upload or delete request :
Jul 06 09:04:14 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 09:04:14 (xrootd-ccdcacli434) [] Server: TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Jul 06 08:51:48 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 08:51:48 (xrootd-ccdcacli434) [door:xrootd-ccdcacli434@xrootd-ccdcacli434Domain:AAYcjpu3+Pg] TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Jul 06 08:51:48 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 08:51:48 (xrootd-ccdcacli434) [] Server: TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Another suspicious activity is this on many files :
The file is first read successfully with webdav and then fails 17 times with xrootd at the same time with No connection from client after 300 seconds. Giving up.
error
No connection from client after 300 s
usually means that client did not connect to mover within 5 minutes
Transfer interrupted
- look for log entries on on that pool corresponding to that tranfer,.
Transfer is identified by session id string, looking like "door:Xrootd01-fndcadoor01@xrootd01-fndcadoor01Domain:AAYc3hPdkMg" (for instance)
To me it looks like client issues remove command TBH. It would be great to have client side log (of failure of course, I would like to see if they log the removal).
That output in form on table from @ageorget looks like it is query of billing DB, can it include initator and transaction? Essentially just do "select *" May be open RT ticket and upload it there (together with access file)
In the XRootD door log file, I only found some SSL handshake timeout but not at the exact timestamp of the upload or delete request :
Jul 06 09:04:14 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 09:04:14 (xrootd-ccdcacli434) [] Server: TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms. Jul 06 08:51:48 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 08:51:48 (xrootd-ccdcacli434) [door:xrootd-ccdcacli434@xrootd-ccdcacli434Domain:AAYcjpu3+Pg] TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms. Jul 06 08:51:48 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 08:51:48 (xrootd-ccdcacli434) [] Server: TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Another suspicious activity is this on many files : The file is first read successfully with webdav and then fails 17 times with xrootd at the same time with
No connection from client after 300 seconds. Giving up.
error
Regarding the read pattern, there is a job running on that file which tries to download the file locally
2024-07-10 14:29:02 UTC None/[904754301]DownloadInputData INFO: Preliminary checks OK : now downloading /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst from IN2P3-BUFFER
2024-07-10 14:29:02 UTC None/[904754301]DownloadInputData VERBOSE: Attempting to download file /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst from IN2P3-BUFFER:
Trying to download root://proxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/buffer/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst to /tmp/904754301/00231199_00003326_7.RXcHad.Strip.dst
1 records inserted to MonitoringDB
Failure in plugin to perform getFile Plugin: GFAL2_XROOT lfn: /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst error GError('Error on XrdCl::CopyProcess::Run(): [FATAL] Redirect limit has been reached: (source)', 40)
Trying to download https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/buffer/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst to /tmp/904754301/00231199_00003326_7.RXcHad.Strip.dst
2024-07-10 14:29:03 UTC None/[904754301]DownloadInputData VERBOSE: File successfully downloaded locally (/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/
it fails with xrootd, and then tries successfully with https. The timestamp matches with the http read. What is interesting is that the failed xroot logs on the dcache side are timestamped exactly 300 after, so indeed just like if things had kept going, while the client definitely received an error and moved on
To me it looks like client issues remove command TBH. It would be great to have client side log (of failure of course, I would like to see if they log the removal).
I am afraid it is not logged at the standard level. Running in debug isn't an option either, as it's way too verbose :grimacing:
As a practical step I would consider changing
pool.mover.xrootd.timeout.connect = 300
to something like
pool.mover.xrootd.timeout.connect = 5
This needs to be done on the pools. I would try one.
That output in form on table from @ageorget looks like it is query of billing DB, can it include initator and transaction? Essentially just do "select *" May be open RT ticket and upload it there (together with access file)
I just opened a rt ticket with additional logs (xrootd/pool domain/access logs).
As a practical step I would consider changing
pool.mover.xrootd.timeout.connect = 300
to something like
pool.mover.xrootd.timeout.connect = 5
This needs to be done on the pools. I would try one.
Ok I did it on one pool and restarted it. Should I have to check something particular then?
Checking movers on pool, I see all of these duplicated (x17) xrootd movers transferring nothing (bytes=0) until they reach the timeout. moverinfo_lhcb.txt
I thinkis that these are reads in similar situation as your writes - mover started but client hasn't connected.
And, the door did not get any control command from client or did not understand what client has communicated.
What are JTM settings on pools?
(just attach info -a from pool, i will figure out)
NB: I am currently looking for a palliative "fix", ultimately I need to understand why these failures happend in the first place (I refer to WRITEs).
Returning to original issue - In principle, if dCache thinks it was incomplete transfer it tries to remove the file, just does it after a long timeout.
That output in form on table from @ageorget looks like it is query of billing DB, can it include initator and transaction? Essentially just do "select *" May be open RT ticket and upload it there (together with access file)
I just opened a rt ticket with additional logs (xrootd/pool domain/access logs).
As a practical step I would consider changing
pool.mover.xrootd.timeout.connect = 300
to something like
pool.mover.xrootd.timeout.connect = 5
This needs to be done on the pools. I would try one.
Ok I did it on one pool and restarted it. Should I have to check something particular then?
Just one thing - do you see more errors on that pool or num ber of errors is the same as before (errors like "No connection from client after").
If you do not see increase of these, then I would use this setting on all your pools.
And of this helps with file deletion we have a breathing room to address the root cause. Or at least understand it better. My current thinking is that dCache and the client fail to agree on whether the transfer failed or not. The client thinks it has failed and it has bailed out, whereas dCache thinks that the transfer is "in progress" and keeps waiting for client to connect to the pool. Shorttening that wait time from 5 minutes to 5 seconds might alleviate your problem, but ultimately we need to understand if we are missing something and not shutting down the mover properly on client/door interaction. At least this is my conjecture for now.
Returning to original issue - In principle, if dCache thinks it was incomplete transfer it tries to remove the file, just does it after a long timeout.
The xrootd door should only remove a failed upload if the door knows the file's PNFS-ID, which it obtains when the transfer was started. It also includes this PNFS-ID when sending a message to the namespace requesting the entry is deleted.
The namespace should only remove the file if it exists and has the expected PNFS-ID. If the file exists but has a different PNFS-ID then the namespace should ignore the door's deletion request.
In this case, the file was (seemingly) successfully uploaded via WebDAV. The transfer would over overwritten the incomplete file (from xrootd transfer), so the WebDAV-uploaded file would have the same name but a different PNFS-ID.
Therefore, the failed xrootd transfer should not have deleted the successful WebDAV-uploaded file.
An alternative explanation is that the client connected to the xrootd door an issued a delete operation against the file, perhaps as a recovery procedure for the failed upload.
An alternative explanation is that the client connected to the xrootd door an issued a delete operation against the file, perhaps as a recovery procedure for the failed upload.
That is definitely the case. gfal2
issues the removal, and only then returns. So in principle the removal is done when we start the second upload with https
. But here we see it after
An alternative explanation is that the client connected to the xrootd door an issued a delete operation against the file, perhaps as a recovery procedure for the failed upload.
That's what I thought from the get-go, but @chaen claimed that xrootd and https transfers are sequential.
Or... indeed.... the delete request to xrootd door was stuck in message queue for 5 minutes!?
That's why if we could get the complete billing record corresponding to:
datestamp | client | path | cellname | action | errormessage
----------------------------+----------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------+---------+---------------------------------------------------------
2024-07-06 08:52:38.584+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | Transfer interrupted
2024-07-06 09:11:02.181+02 | 134.158.51.208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | No connection from client after 300 seconds. Giving up.
2024-07-06 09:20:19.91+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | No connection from client after 300 seconds. Giving up.
2024-07-06 09:25:32.413+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | webdav-ccdcatli346 | request |
2024-07-06 09:30:16.004+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | remove |
It may tell us something
OK, here's my best guess at what happened.
kXR_open
) the target file.kXR_delete
) on the file.This explanation seems plausible to me, but it lacks direct evidence for the key points --- it's mainly guess work.
I agree with @DmitryLitvintsev, getting the complete billing records for the selected entries may help us gain a better understanding.
The complete billing record was sent in the RT tickets with logs from xrootd door and pool logs. https://rt.dcache.org/Ticket/Display.html?id=10634
What are JTM settings on pools?
--- jtm (Job timeout manager) --- xrootd (lastAccess=86400;total=86400) webdav (lastAccess=900;total=14400) gridftp (lastAccess=900;total=14400) regular (lastAccess=0;total=0) p2p (lastAccess=0;total=0)
what you describe @paulmillar is exactly my gut feeling on what's happening. In particular since I have seen such occurrences at other sites (not dcache) but with SRM which was asynchronous by nature.
In my opinion the client should not call remove after failure. That should solve this, no?
In my opinion the client should not call remove after failure. That should solve this, no?
Exactly.
This is particularly true, given that the xrootd protocol has the persist-on-successful-close (POSC) flag, which (I think) should give the desired behaviour.
The complete billing record was sent in the RT tickets with logs from xrootd door and pool logs. https://rt.dcache.org/Ticket/Display.html?id=10634
What are JTM settings on pools?
--- jtm (Job timeout manager) --- xrootd (lastAccess=86400;total=86400) webdav (lastAccess=900;total=14400) gridftp (lastAccess=900;total=14400) regular (lastAccess=0;total=0) p2p (lastAccess=0;total=0)
Thanks. If you do not have tape backend you can lower xrootd lastAccess timeout from 24 hours to say 900 seconds like other protocols. This has nothing to do with this issue reported here tho.
I suggets to examine of raising default:
xrootd.limits.threads = 1000
to higher number helps. Or running multiple xrootd door behind HAProxy (we do that at Fermilab).
On xrootd door run "connections" command, count how many connections you have during peaks
Hi, Ok I switched to 900 seconds for JTM xrootd.
We have 2 xrootd doors behind a lb for each VO.
On xrootd door run "connections" command, count how many connections you have during peaks
ssh -q -i ~/.ssh/id_dcache -p 22224 ccdcamcli06 "\s xrootd-ccdcacli434 connections" | wc -l
3193
ssh -q -i ~/.ssh/id_dcache -p 22224 ccdcamcli06 "\s xrootd-ccdcacli424 connections" | wc -l
3238
I suggets to examine of raising default:
xrootd.limits.threads = 1000
I will try to increase it a bit.
Hi,
Lately, we have put the IN2P3 storage (dCache 9.2.6) under a lot of stress, leading to some instabilities. In practice, some
xroot
upload attempts from jobs running on worker nodes ended up hitting a timeout. When that happens, the job retries the same upload, using another protocol (https
).An example of job upload log would be the following
These are UTC time. If I convert to CET for easier comparison later on we have:
https
. (it is not in the logs, we don't log successes :-) )https
Uploads are done via the
gfal2
python API, in a completely sequential way (no thread, no process). It boils down to plainAnd this is called once per protocol. So first
dest_url
isxroot
, then it'shttps
if it fails.However, the file does not exist on the storage. Asking Adrien (@ageorget) for the dCache logs, this is what he has
There are multiple things to note:
Transfer interrupted
mean ? And if the transfer is interrupted, why do we still getNo connection from client after 300 s
?https
request at 09:25:32, which corresponds to what we see (ok, interpret) in the DIRAC logs.xrootd
removal request at 09:30:16, coming from the very same host. At this stage, the job is already finished (we have the exit timestamp at 09:25:53), and even the pilot job finished its duty (exit timestamp 09:26:01). We do not have explicit removal code, but gfal has a cleanup mechanism which triggers upon failed upload. But this is absolutely synchronous from the client side, and we would not have received thexroot
error (09:25:20) if thedelete
call had not returned.How that asynchronous removal can happen ? Can it be that the removal request reaches the headnode, but is only executed later ?
This is extremely bad, as we are convinced that the file has been successfully written, but in fact it is not. Effectively, this is data loss that has been biting us badly for the past couple of weeks.
Thanks a lot for your help !