dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

Transfer failures with multiple RemoteTransferManagers #6754

Open vingar opened 2 years ago

vingar commented 2 years ago

Hello,

When enabling a second RemoteTransferManagers cell on one instance at BNL. We observed some 5-10% failures in transfers when it was enabled (see red on the plot).

In the billing logs, the errors are related to post transfer validation most likely on the pools like:

Post-processing failed: No such file or directory: .... 

On FTS/client side, these errors appear as:

Transfer failed: failure: RemoteTransferManager restarted.

Let us know if more details is needed.

image

paulmillar commented 2 years ago

The Post-processing failed: No such file or directory message comes from the error-recovery procedure within dCache. When the WebDAV door detects that a PULL request has failed, it deletes the inbound file (in the namespace). This is to prevent "dark data". Once the transfer has finished, the pool checks whether the file still exists in the namespace. The pool then logs this Post-processing failed: No such file or directory if the file is no longer in the namespace and removes the replica.

The RemoteTransferManager restarted message suggests the RTM is indicating that I doesn't know about the transfer, when the door is making a routine query (in order to provide a progress report).

A few questions:

Which version of dCache is the WebDAV door running?

Do these transfer failures stop (i.e., all transfers are successful) if the second RTM is switched off?

Do you see any corresponding entries in the RTM log file (in the log file of the domain hosting the RTM)?

Do you see any patterns in these failures? For example, do they target the same directory (within the namespace)? Use the same RTM? Involve the same pool node? Happen at times when the WebDAV or RTM services are seeing increased load? Correspond to certain times of the day? ... or anything else that is common between failures.

In particular, what is the message queue length in the WebDAV door cell when these failures are taking place?

Cheers, Paul.

vingar commented 2 years ago

A few questions: Which version of dCache is the WebDAV door running?

7.2.21 is deployed uniformly (doors, pools, central nodes.)

Do these transfer failures stop (i.e., all transfers are successful) if the second RTM is switched off?

Yes, these transfer failures stop when the second RTM is switched off (as illustrated by the plot).

Do you see any corresponding entries in the RTM log file (in the log file of the domain hosting the RTM)?

The problem can be reproduced when introducing a new RTM and we will have to do it to answer this question.

Do you see any patterns in these failures?

The errors appeared with the introduction of a new RTM and disappeared when this one is switched off. The load is constant (nb trasnfers over time) during the introduction of a new RTM.

For example, do they target the same directory (within the namespace)?

No, the directories involved in the transfers are different.

Use the same RTM? Involve the same pool node? Happen at times when the WebDAV or RTM services are seeing increased load? Correspond to certain times of the day? ... or anything else that is common between failures.

The errors are spread on different RTMS. The two doors have respectively a pool (pool 1 and pool 2) on the same node with a DMZ replication to another pools node. The errors seem to cover all possibility (RTM 1 - Pool 1, RTM 1 - Pool 2, RTM 2 - Pool 1, RTM 2 - Pool 2). The load is constant with 50-100 concurrent transfers.

In particular, what is the message queue length in the WebDAV door cell when these failures are taking place?

How do we get this information (e.g. info -a)? The doors are not overloaded.

Vincent

paulmillar commented 2 years ago

Thanks @vingar for the info.

Unfortunately, the cell message queue length is a little tricky to get.

It is available from the System cell (of the domain hosting the WebDAV door) using the ps -f command. The queue length is the third column, labelled Queue.

Alternatively, the queue length is available via the (old) web interface's "Cell Services" page, in the column with the title RP.

Cheers, Paul.

vingar commented 2 years ago

Hi,

l-AAXndfG1HlA-<unknown>-AAXndfG5Z_g   A       0         0       2 LocationMgrTunnel    Connected to webdav-dcdndoor04_httpsDomain     
l-AAXndfG1HlA-<unknown>-AAXndfG5a-A   A       0         0       2 LocationMgrTunnel    Connected to webdav-dcdndoor03_httpsDomain     
l-AAXndfG1HlA-<unknown>-AAXndfG5ZBA   A       0         0       2 LocationMgrTunnel    Connected to webdav2-dcdndoor04_httpsDomain    
l-AAXndfG1HlA-<unknown>-AAXndfG4-pg   A       0         0       2 LocationMgrTunnel    Connected to webdav2-dcdndoor03_httpsDomain  

What is the recommendation with multiple RTMs for the transfermanagers.enable.db option ?

Vincent

vingar commented 2 years ago

Logs from RemoteTransferManagers for one file failing: 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C

[root@dcdncore01 ~]# grep 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C /var/log/dcache/dcdncore01Domain.log 
    PNFS-ID: 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C
30 Aug 2022 10:35:28 (RemoteTransferManager) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA WebDAV2-dcdndoor04 RemoteTransferManager] PoolMgrSelectPoolMsg: PnfsId=0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C;StorageInfo=size=0;new=true;stored=false;sClass=dunepro:DISKDATA;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/sdcc.bnl.gov/data/dune/RSE/ftstest/dc4-vd-coldbox-bottom/1c/8f/dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5;uid=-1;writeToken=6;gid=-1;StoreName=dunepro;xattr.xdg.origin.url=https://eospublic.cern.ch//eos/experiment/neutplatform/protodune/dune/dc4-vd-coldbox-bottom/1c/8f/dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5;links=0000C336567E6BD741839C5CD1D8F7C336E6 dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5;store=dunepro;group=DISKDATA;bfid=<Unknown>;;
30 Aug 2022 10:35:28 (RemoteTransferManager) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA WebDAV2-dcdndoor04 RemoteTransferManager] poolManagerReply = PnfsId=0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C;StorageInfo=size=0;new=true;stored=false;sClass=dunepro:DISKDATA;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/sdcc.bnl.gov/data/dune/RSE/ftstest/dc4-vd-coldbox-bottom/1c/8f/dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5;uid=-1;writeToken=6;gid=-1;StoreName=dunepro;SpaceTokenDescription=DUNEPRO_DISK;xattr.xdg.origin.url=https://eospublic.cern.ch//eos/experiment/neutplatform/protodune/dune/dc4-vd-coldbox-bottom/1c/8f/dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5;links=0000C336567E6BD741839C5CD1D8F7C336E6 dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5;SpaceToken=6;LinkGroupId=6;store=dunepro;group=DISKDATA;bfid=<Unknown>;;PoolName=dcdndoor03_1 PoolAddress=dcdndoor03_1@dcdndoor03oneDomain
30 Aug 2022 10:35:28 (RemoteTransferManager) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA WebDAV2-dcdndoor04 RemoteTransferManager] Pool PoolName=dcdndoor03_1 PoolAddress=dcdndoor03_1@dcdndoor03oneDomain will deliver file 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C mover id is 100667670
30 Aug 2022 10:38:02 (RemoteTransferManager) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA dcdndoor03_1 DoorTransferFinished 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C] poolDoorMessageArrived, doorMessage.getReturnCode()=10001
30 Aug 2022 10:38:02 (RemoteTransferManager) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA dcdndoor03_1 DoorTransferFinished 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C] sending error reply 10027:Post-processing failed: No such file or directory: 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C for id: 1661870128160000
    PNFS-ID: 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C
30 Aug 2022 10:38:02 (RemoteTransferManager) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA dcdndoor03_1 DoorTransferFinished 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C] Sending info: DoorRequestInfoMessage{transactionTime=154528, client='eospublic.cern.ch', clientChain='unknown', transferPath='/pnfs/sdcc.bnl.gov/data/dune/RSE/ftstest/dc4-vd-coldbox-bottom/1c/8f/dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5'} InfoMessage{cellType='door', messageType='request', cellAddress=RemoteTransferManager@dcdncore01Domain, timeQueued=11, resultCode=10027, message='Post-processing failed: No such file or directory: 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C', timestamp=1661870128160, transaction='door:RemoteTransferManager@dcdncore01Domain:1661870128160-677', transactionID=677, subject=Subject:
30 Aug 2022 10:38:02 (RemoteTransferManager) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA dcdndoor03_1 DoorTransferFinished 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C] Cancelling the mover timer for handler id 1661870128160000
30 Aug 2022 10:38:02 (RemoteTransferManager) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA dcdndoor03_1 DoorTransferFinished 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C] finishTransfer() num_transfers = 43 DECREMENT

On the pool:

# grep 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C /var/log/dcache/*
/var/log/dcache/dcdndoor03oneDomain.log:30 Aug 2022 10:38:02 (dcdndoor03_1) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA RemoteTransferManager PoolAcceptFile 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C] Transfer failed in post-processing: No such file or directory: 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C
/var/log/dcache/dcdndoor03oneDomain.log:30 Aug 2022 10:38:02 (dcdndoor03_1) [door:WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain:AAXndkqmlrA RemoteTransferManager PoolAcceptFile 0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C] Failed to read file size: java.nio.file.NoSuchFileException: /data/dcache_pool_1/pool/data/0000D9D6E3BAF0F04C9C9DF3F9301DC42A0C
paulmillar commented 2 years ago

Two more things:

Could you check the domain log file (for domain webdav2-dcdndoor04_httpsDomain) to see whether the WebDAV door logged anything about this transfer ... or anything else (out of the ordinary) between 10:35:28 and 10:38:02 ?

Could you also look at the access log file for the WebDAV door (WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain) for details of this COPY request?

vingar commented 2 years ago

Could you check the domain log file (for domain webdav2-dcdndoor04_httpsDomain) to see whether the WebDAV door logged anything about this transfer ... or anything else (out of the ordinary) between 10:35:28 and 10:38:02 ?

Nothing special. Messages like parent does not exist:

Could you also look at the access log file for the WebDAV door (WebDAV2-dcdndoor04@webdav2-dcdndoor04_httpsDomain) for details of this COPY request?

level=WARN ts=2022-08-30T10:37:53.252-0400 event=org.dcache.webdav.request request.method=COPY request.url=https://dcdndoor.sdcc.bnl.gov//pnfs/sdcc.bnl.gov/data/dune/RSE/ftstest/dc4-vd-coldbox-bottom/1c/8f/dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5 response.code=202 socket.remote=130.199.185.66:34092 user-agent="fts_url_copy/3.11.2 gfal2/2.20.5 neon/0.0.29" user.dn="CN=412520193,CN=746391844,CN=Robot: BNL SDCC Grid bot account,CN=614522,CN=bnlbot,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=100338:31167[9A9GRedo:GpO/5WoS] tpc.credential=none tpc.error="RemoteTransferManager restarted" tpc.require-checksum=false tpc.source=https://eospublic.cern.ch//eos/experiment/neutplatform/protodune/dune/dc4-vd-coldbox-bottom/1c/8f/dc4_np02bde_307100901_np02_bde_coldbox_run012352_0039_20211215T232400.hdf5 duration=145109
paulmillar commented 2 years ago

Reconstructed timeline:

10:35:28  Transfer starts
10:35:33  WebDAV queries RTM for progress.
10:35:38  WebDAV queries RTM for progress.
10:35:43  WebDAV queries RTM for progress.
10:35:48  WebDAV queries RTM for progress.
[...]
10:37:43  WebDAV queries RTM for progress.
10:37:48  WebDAV queries RTM for progress (RTM says no-such-transfer?).
10:37:53  WebDAV queries RTM for progress.; RTM says no-such-transfer; WebDAV aborts transfer.
10:38:02  Transfer completes on pool; pool notices file deleted in namespace
paulmillar commented 2 years ago

My initial thought was that this problem is due to the race-condition, between the RTM notifying the WebDAV door that the transfer has completed and the WebDAV door discovering this for itself (when requesting info for a progress report).

Although the WebDAV aborts the transfer towards the end of the transfer, it doesn't seem to be a result of this race-condition: it happens too early. Moreover, the WebDAV door requires the RTM to claim the transfer doesn't exist twice before cancelling the transfer.

Therefore, I think the more likely explanation is that, for whatever reason, the WebDAV door looses track of which RTM is handling the transfer. Given the code, I can't explain how this could happen.

@vingar , could you check whether the WebDAV door accepted any new transfers between 10:35:28 and 10:38:02? This is slightly tricky to do, as the access log file records when transfers finish, not when they start. For COPY requests, the duration field describes how long a transfer took (in milliseconds). Subtracting this value from the timestamp gives the time when the transfer started. You could also look at the two RTMs, for log entries that contain PoolMgrSelectPoolMsg. These log entries should provide timestamps when a WebDAV door accepted a new transfer.

Meanwhile, I will update the code to provide some additional information when failing a transfer and see if there is anything else that could be improved.

vingar commented 2 years ago

Yes, from the billings DBs and timestamps other successful transfers are ongoing on the same RTMs at the same time.

Another observation, with enable.db=true for RTMsn there are SQL errors with duplicate key value violates unique constraint "REMOTETRANSFERS_B_PK"/.

paulmillar commented 2 years ago

The concurrency is interesting.

A more specific question: did any transfer (that used this WebDAV door) start or end while this transfer was ongoing; i.e., between 10:35:28 and 10:38:02 ?

Another question: do the problems correlate with enabling the DB; i.e., with enable.db=true in the RTM?

vingar commented 2 years ago

A more specific question: did any transfer (that used this WebDAV door) start or end while this transfer was ongoing; i.e., between 10:35:28 and 10:38:02

See plot for transfer involving RTM-pool. The door is not mentioned in the billing db but I would say most likely yes. In other situation the load was sustained and both dav doors were actively serving other transfers with some failures

image

Another question: do the problems correlate with enabling the DB; i.e., with enable.db=true in the RTM?

The problems doesn't correlate with enabling the DB. The failures are still there with both RTM with enable.db set to False