File lost due to late removal

chaen commented 4 months ago

Hi,

Lately, we have put the IN2P3 storage (dCache 9.2.6) under a lot of stress, leading to some instabilities. In practice, some xroot upload attempts from jobs running on worker nodes ended up hitting a timeout. When that happens, the job retries the same upload, using another protocol (https).

An example of job upload log would be the following

2024-07-06 06:45:41 UTC dirac-jobexec/FailoverTransfer INFO: Attempting dm.putAndRegister ('/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst','/tmp/902787232/00220520_00166463_1.qee.dst','IN2P3-DST',guid='D3A64FBE-3B5A-11EF-9F49-BC97E1E91A30',catalog='None', checksum = '7795920d')
2024-07-06 07:25:20 UTC dirac-jobexec/DIRAC.Resources.Storage.StorageElement/SE[IN2P3-DST] VERBOSE: Failure in plugin to perform putFile Plugin: GFAL2_XROOT lfn: /lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst error Failed to copy /tmp/902787232/00220520_00166463_1.qee.dst to root://proxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst: GError('Error on XrdCl::CopyProcess::Run(): [ERROR] Operation expired:  (destination)', 116)
2024-07-06 07:25:33 UTC dirac-jobexec/FailoverTransfer VERBOSE: {'OK': True, 'Value': {'Successful': {'/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst': {'put': 2390.769085407257, 'register': 0.07753562927246094}}, 'Failed': {}}}

These are UTC time. If I convert to CET for easier comparison later on we have:

08:45:41 attempt to upload the file with xroot
09:25:20 we get the xroot error
right after, we attempt with the next protocol, which is https. (it is not in the logs, we don't log successes :-) )
09:25:33 the file was properly transferred using https

Uploads are done via the gfal2 python API, in a completely sequential way (no thread, no process). It boils down to plain

ctx = gfal2.creat_context()
params = ctx.transfer_parameters()
params.create_parent = True
params.timeout = large_number
params.nbstreams = 4
params.overwrite = True
params.set_checksum(gfal2.checksum_mode.both, "Adler32", "")
ctx.filecopy(params, local_file, dest_url)

And this is called once per protocol. So first dest_url is xroot, then it's https if it fails.

However, the file does not exist on the storage. Asking Adrien (@ageorget) for the dCache logs, this is what he has

       datestamp          |              client              |                                                    path                                                    |                  cellname                   | action  |                      errormessage                       
----------------------------+----------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------+---------+---------------------------------------------------------
 2024-07-06 08:52:38.584+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | Transfer interrupted
 2024-07-06 09:11:02.181+02 | 134.158.51.208                   | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | No connection from client after 300 seconds. Giving up.
 2024-07-06 09:20:19.91+02  | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | No connection from client after 300 seconds. Giving up.
 2024-07-06 09:25:32.413+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | webdav-ccdcatli346 | request | 
 2024-07-06 09:30:16.004+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | remove  |

There are multiple things to note:

While the two IP addresses resolve to the same node, why is there an IPv4 and the rest is IPv6. is it just a monitoring artifact ?
The failed upload is only noticed at 09:25:20 by the job, while there has been 3 errors in the server log before : 8:52:35, 09:11:02 and 09:20:19
What are these 3 errors ? What does Transfer interrupted mean ? And if the transfer is interrupted, why do we still get No connection from client after 300 s ?
You can see a successful https request at 09:25:32, which corresponds to what we see (ok, interpret) in the DIRAC logs.
There is this xrootd removal request at 09:30:16, coming from the very same host. At this stage, the job is already finished (we have the exit timestamp at 09:25:53), and even the pilot job finished its duty (exit timestamp 09:26:01). We do not have explicit removal code, but gfal has a cleanup mechanism which triggers upon failed upload. But this is absolutely synchronous from the client side, and we would not have received the xroot error (09:25:20) if the delete call had not returned.

How that asynchronous removal can happen ? Can it be that the removal request reaches the headnode, but is only executed later ?

This is extremely bad, as we are convinced that the file has been successfully written, but in fact it is not. Effectively, this is data loss that has been biting us badly for the past couple of weeks.

Thanks a lot for your help !

paulmillar commented 4 months ago

Hi @chaen,

If I may suggest, please urgently ask Adrien to keep this xrootd door's access log file on 2024-07-06, from 08:52:38.584+02 to 09:30:16.004+02.

The access log files are often rotated (and delete) relatively quickly, so being prompt with this request would be wise.

Cheers, Paul.

ageorget commented 4 months ago

Hi, I checked the xrootd access logs (in attachment) but I couldn't find anything relevant except no such file msg. I think the default log level is not enough verbose to get details about those requests. xrootd-ccdcacli434Domain.access.2024-07-06.gz

On dCache side, the xrootd queue on pools were full with active requests and hundreds of xrootd transfers on queue. I don't know how the transfer timeout is managed on client side but I saw lot of Transfer interrupted and No connection from client after 300 seconds. Giving up.errors.

Adrien

paulmillar commented 4 months ago

The worrying part is, as @ageorget says, there's no mention in the xrootd door's access log file for either the failed upload attempt (at ~08:52:38) or the problematic delete (at ~09:30:16).

The filename 00220520_00166463_1.qee.dst only appears as 13 failed stat requests, from 09:40:33 to 09:56:18.

The access log file is recording at INFO level (which is good), so it should capture all successful and failed client requests, so I do not understand why billing is reporting door activity that the door does not report as having any corresponding client activity.

Does the domain log file (for the domain hosting the xrootd door) indicate any errors/problems at around this time?

ageorget commented 4 months ago

In the XRootD door log file, I only found some SSL handshake timeout but not at the exact timestamp of the upload or delete request :

Jul 06 09:04:14 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 09:04:14 (xrootd-ccdcacli434) [] Server: TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Jul 06 08:51:48 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 08:51:48 (xrootd-ccdcacli434) [door:xrootd-ccdcacli434@xrootd-ccdcacli434Domain:AAYcjpu3+Pg] TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Jul 06 08:51:48 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 08:51:48 (xrootd-ccdcacli434) [] Server: TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.

Another suspicious activity is this on many files : The file is first read successfully with webdav and then fails 17 times with xrootd at the same time with No connection from client after 300 seconds. Giving up. error

DmitryLitvintsev commented 4 months ago

No connection from client after 300 s usually means that client did not connect to mover within 5 minutes Transfer interrupted - look for log entries on on that pool corresponding to that tranfer,.

Transfer is identified by session id string, looking like "door:Xrootd01-fndcadoor01@xrootd01-fndcadoor01Domain:AAYc3hPdkMg" (for instance)

To me it looks like client issues remove command TBH. It would be great to have client side log (of failure of course, I would like to see if they log the removal).

DmitryLitvintsev commented 4 months ago

That output in form on table from @ageorget looks like it is query of billing DB, can it include initator and transaction? Essentially just do "select *" May be open RT ticket and upload it there (together with access file)

chaen commented 4 months ago

In the XRootD door log file, I only found some SSL handshake timeout but not at the exact timestamp of the upload or delete request :
Jul 06 09:04:14 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 09:04:14 (xrootd-ccdcacli434) [] Server: TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Jul 06 08:51:48 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 08:51:48 (xrootd-ccdcacli434) [door:xrootd-ccdcacli434@xrootd-ccdcacli434Domain:AAYcjpu3+Pg] TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Jul 06 08:51:48 ccdcacli434 dcache@xrootd-ccdcacli434Domain[7166]: 06 Jul 2024 08:51:48 (xrootd-ccdcacli434) [] Server: TLS handshake failed: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms.
Another suspicious activity is this on many files : The file is first read successfully with webdav and then fails 17 times with xrootd at the same time with No connection from client after 300 seconds. Giving up. error

Regarding the read pattern, there is a job running on that file which tries to download the file locally

2024-07-10 14:29:02 UTC None/[904754301]DownloadInputData INFO: Preliminary checks OK : now downloading /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst from IN2P3-BUFFER
2024-07-10 14:29:02 UTC None/[904754301]DownloadInputData VERBOSE: Attempting to download file /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst from IN2P3-BUFFER:
Trying to download root://proxy@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/buffer/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst to /tmp/904754301/00231199_00003326_7.RXcHad.Strip.dst
1 records inserted to MonitoringDB
Failure in plugin to perform getFile Plugin: GFAL2_XROOT lfn: /lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst error GError('Error on XrdCl::CopyProcess::Run(): [FATAL] Redirect limit has been reached:  (source)', 40)
Trying to download https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/buffer/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/00231199_00003326_7.RXcHad.Strip.dst to /tmp/904754301/00231199_00003326_7.RXcHad.Strip.dst
2024-07-10 14:29:03 UTC None/[904754301]DownloadInputData VERBOSE: File successfully downloaded locally (/lhcb/MC/2018/RXCHAD.STRIP.DST/00231199/0000/

it fails with xrootd, and then tries successfully with https. The timestamp matches with the http read. What is interesting is that the failed xroot logs on the dcache side are timestamped exactly 300 after, so indeed just like if things had kept going, while the client definitely received an error and moved on

chaen commented 4 months ago

To me it looks like client issues remove command TBH. It would be great to have client side log (of failure of course, I would like to see if they log the removal).

I am afraid it is not logged at the standard level. Running in debug isn't an option either, as it's way too verbose :grimacing:

DmitryLitvintsev commented 4 months ago

As a practical step I would consider changing

pool.mover.xrootd.timeout.connect = 300

to something like

pool.mover.xrootd.timeout.connect = 5

This needs to be done on the pools. I would try one.

ageorget commented 4 months ago

That output in form on table from @ageorget looks like it is query of billing DB, can it include initator and transaction? Essentially just do "select *" May be open RT ticket and upload it there (together with access file)

I just opened a rt ticket with additional logs (xrootd/pool domain/access logs).

As a practical step I would consider changing
pool.mover.xrootd.timeout.connect = 300
to something like
pool.mover.xrootd.timeout.connect = 5
This needs to be done on the pools. I would try one.

Ok I did it on one pool and restarted it. Should I have to check something particular then?

ageorget commented 4 months ago

Checking movers on pool, I see all of these duplicated (x17) xrootd movers transferring nothing (bytes=0) until they reach the timeout. moverinfo_lhcb.txt

DmitryLitvintsev commented 4 months ago

I thinkis that these are reads in similar situation as your writes - mover started but client hasn't connected. And, the door did not get any control command from client or did not understand what client has communicated.
What are JTM settings on pools? (just attach info -a from pool, i will figure out) NB: I am currently looking for a palliative "fix", ultimately I need to understand why these failures happend in the first place (I refer to WRITEs). Returning to original issue - In principle, if dCache thinks it was incomplete transfer it tries to remove the file, just does it after a long timeout.

DmitryLitvintsev commented 4 months ago

That output in form on table from @ageorget looks like it is query of billing DB, can it include initator and transaction? Essentially just do "select *" May be open RT ticket and upload it there (together with access file)

I just opened a rt ticket with additional logs (xrootd/pool domain/access logs).
As a practical step I would consider changing
pool.mover.xrootd.timeout.connect = 300
to something like
pool.mover.xrootd.timeout.connect = 5
This needs to be done on the pools. I would try one.
Ok I did it on one pool and restarted it. Should I have to check something particular then?

Just one thing - do you see more errors on that pool or num ber of errors is the same as before (errors like "No connection from client after").

If you do not see increase of these, then I would use this setting on all your pools.

And of this helps with file deletion we have a breathing room to address the root cause. Or at least understand it better. My current thinking is that dCache and the client fail to agree on whether the transfer failed or not. The client thinks it has failed and it has bailed out, whereas dCache thinks that the transfer is "in progress" and keeps waiting for client to connect to the pool. Shorttening that wait time from 5 minutes to 5 seconds might alleviate your problem, but ultimately we need to understand if we are missing something and not shutting down the mover properly on client/door interaction. At least this is my conjecture for now.

paulmillar commented 4 months ago

Returning to original issue - In principle, if dCache thinks it was incomplete transfer it tries to remove the file, just does it after a long timeout.

The xrootd door should only remove a failed upload if the door knows the file's PNFS-ID, which it obtains when the transfer was started. It also includes this PNFS-ID when sending a message to the namespace requesting the entry is deleted.

The namespace should only remove the file if it exists and has the expected PNFS-ID. If the file exists but has a different PNFS-ID then the namespace should ignore the door's deletion request.

In this case, the file was (seemingly) successfully uploaded via WebDAV. The transfer would over overwritten the incomplete file (from xrootd transfer), so the WebDAV-uploaded file would have the same name but a different PNFS-ID.

Therefore, the failed xrootd transfer should not have deleted the successful WebDAV-uploaded file.

An alternative explanation is that the client connected to the xrootd door an issued a delete operation against the file, perhaps as a recovery procedure for the failed upload.

chaen commented 4 months ago

An alternative explanation is that the client connected to the xrootd door an issued a delete operation against the file, perhaps as a recovery procedure for the failed upload.

That is definitely the case. gfal2 issues the removal, and only then returns. So in principle the removal is done when we start the second upload with https. But here we see it after

DmitryLitvintsev commented 4 months ago

An alternative explanation is that the client connected to the xrootd door an issued a delete operation against the file, perhaps as a recovery procedure for the failed upload.

That's what I thought from the get-go, but @chaen claimed that xrootd and https transfers are sequential.
Or... indeed.... the delete request to xrootd door was stuck in message queue for 5 minutes!? That's why if we could get the complete billing record corresponding to:

       datestamp          |              client              |                                                    path                                                    |                  cellname                   | action  |                      errormessage                       
----------------------------+----------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------+---------+---------------------------------------------------------
 2024-07-06 08:52:38.584+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | Transfer interrupted
 2024-07-06 09:11:02.181+02 | 134.158.51.208                   | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | No connection from client after 300 seconds. Giving up.
 2024-07-06 09:20:19.91+02  | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | request | No connection from client after 300 seconds. Giving up.
 2024-07-06 09:25:32.413+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | webdav-ccdcatli346 | request | 
 2024-07-06 09:30:16.004+02 | 2001:660:5009:312:134:158:51:208 | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/LHCb/Collision24/QEE.DST/00220520/0016/00220520_00166463_1.qee.dst | xrootd-ccdcacli434 | remove  |

It may tell us something

paulmillar commented 4 months ago

OK, here's my best guess at what happened.

the xrootd door was in an overload situation, in particular all (or almost all) of the threads in the network-handling thread pool are blocked. This is most likely due to transfers queuing on pools.
gfal attempted an xrootd upload by opening (kXR_open) the target file.
for whatever reason (no spare thread in the threadpool?) this request took a long time to process.
The client decides the open request is taking too long and times out, disconnecting from dCache. Importantly, disconnecting doesn't cancel the request. It will still be processed, but who knows when.
gfal attempts a recovery procedure by issuing a xrootd delete (kXR_delete) on the file.
For whatever reason (no available thread to process the request?), the response to this delete request takes a long time.
The client decides the delete operation is taking too long and times out, disconnecting from dCache. Importantly, disconnecting doesn't cancel the request. It will still be processed, but who knows when.
gfal now attempts to upload the file with WebDAV, which is successful.
after five minutes, the pool gives up waiting for the client to connect. The pool returns an error to the door.
the xrootd door thread (waiting for this message from the pool) processed the response. It would send a response to the client, but the client has already gone. The thread then becomes available to process another request.
the previous delete request (from step 4.) is finally enacted, deleting the WebDAV upload.

This explanation seems plausible to me, but it lacks direct evidence for the key points --- it's mainly guess work.

I agree with @DmitryLitvintsev, getting the complete billing records for the selected entries may help us gain a better understanding.

ageorget commented 4 months ago

The complete billing record was sent in the RT tickets with logs from xrootd door and pool logs. https://rt.dcache.org/Ticket/Display.html?id=10634

What are JTM settings on pools?

--- jtm (Job timeout manager) ---
xrootd (lastAccess=86400;total=86400)
webdav (lastAccess=900;total=14400)
gridftp (lastAccess=900;total=14400)
regular (lastAccess=0;total=0)
p2p (lastAccess=0;total=0)

chaen commented 4 months ago

what you describe @paulmillar is exactly my gut feeling on what's happening. In particular since I have seen such occurrences at other sites (not dcache) but with SRM which was asynchronous by nature.

DmitryLitvintsev commented 4 months ago

In my opinion the client should not call remove after failure. That should solve this, no?

paulmillar commented 4 months ago

In my opinion the client should not call remove after failure. That should solve this, no?

Exactly.

This is particularly true, given that the xrootd protocol has the persist-on-successful-close (POSC) flag, which (I think) should give the desired behaviour.

DmitryLitvintsev commented 4 months ago

The complete billing record was sent in the RT tickets with logs from xrootd door and pool logs. https://rt.dcache.org/Ticket/Display.html?id=10634

What are JTM settings on pools?
--- jtm (Job timeout manager) ---
  xrootd (lastAccess=86400;total=86400)
  webdav (lastAccess=900;total=14400)
  gridftp (lastAccess=900;total=14400)
  regular (lastAccess=0;total=0)
  p2p (lastAccess=0;total=0)

Thanks. If you do not have tape backend you can lower xrootd lastAccess timeout from 24 hours to say 900 seconds like other protocols. This has nothing to do with this issue reported here tho.

I suggets to examine of raising default:

xrootd.limits.threads = 1000

to higher number helps. Or running multiple xrootd door behind HAProxy (we do that at Fermilab).

On xrootd door run "connections" command, count how many connections you have during peaks

ageorget commented 4 months ago

Hi, Ok I switched to 900 seconds for JTM xrootd.

We have 2 xrootd doors behind a lb for each VO.

On xrootd door run "connections" command, count how many connections you have during peaks

ssh -q -i ~/.ssh/id_dcache -p 22224 ccdcamcli06 "\s xrootd-ccdcacli434 connections" | wc -l
3193
ssh -q -i ~/.ssh/id_dcache -p 22224 ccdcamcli06 "\s xrootd-ccdcacli424 connections" | wc -l
3238

I suggets to examine of raising default:
xrootd.limits.threads = 1000
I will try to increase it a bit.

dCache / dcache

File lost due to late removal #7613