dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
277 stars 132 forks source link

RemoteTransferManager transfer verification #7548

Closed cfgamboa closed 2 months ago

cfgamboa commented 2 months ago

Hello,

We are observing TPC transfers via FTS failing writing to BNL SE failing with FTS extract

[5] TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Connection terminated abruptly; Status of TPC request unknown

At dCache we observed this enabling the RTM in debug mode


  PNFS-ID: 000022AAD2089F3049B99ADBA9B368E49FFA
12 Apr 2024 10:16:33 (RemoteTransferManager) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg WebDAV2-dcdoor23-externalipv6 RemoteTransferManager] PoolMgrSelectPoolMsg: PnfsId=000022AAD2089F3049B99ADBA9B368E49FFA;StorageInfo=size=0;new=true;stored=false;sClass=MCTAPE:MC;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1;uid=-1;writeToken=10005;gid=-1;StoreName=MCTAPE;xattr.xdg.origin.url=https://webdav.mwt2.org:2881/atlasdatadisk/rucio/mc16_13TeV/01/b6/AOD.37230856._000004.pool.root.1?copy_mode=pull;links=000083AF845CAB754879AAF385FD6FA3E7B8 AOD.37230856._000004.pool.root.1;store=MCTAPE;group=MC;bfid=<Unknown>;;
12 Apr 2024 10:16:33 (RemoteTransferManager) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg WebDAV2-dcdoor23-externalipv6 RemoteTransferManager] poolManagerReply = PnfsId=000022AAD2089F3049B99ADBA9B368E49FFA;StorageInfo=size=0;new=true;stored=false;sClass=MCTAPE:MC;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1;uid=-1;writeToken=10005;gid=-1;StoreName=MCTAPE;SpaceTokenDescription=ATLASMCTAPE;xattr.xdg.origin.url=https://webdav.mwt2.org:2881/atlasdatadisk/rucio/mc16_13TeV/01/b6/AOD.37230856._000004.pool.root.1?copy_mode=pull;links=000083AF845CAB754879AAF385FD6FA3E7B8 AOD.37230856._000004.pool.root.1;SpaceToken=10005;LinkGroupId=2;store=MCTAPE;group=MC;bfid=<Unknown>;;PoolName=dcdoor35_1 PoolAddress=dcdoor35_1@dcdoor35oneDomain
12 Apr 2024 10:16:33 (RemoteTransferManager) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg WebDAV2-dcdoor23-externalipv6 RemoteTransferManager] Pool PoolName=dcdoor35_1 PoolAddress=dcdoor35_1@dcdoor35oneDomain will deliver file 000022AAD2089F3049B99ADBA9B368E49FFA mover id is 100683981
12 Apr 2024 10:28:42 (RemoteTransferManager) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg dcdoor35_1 DoorTransferFinished 000022AAD2089F3049B99ADBA9B368E49FFA] poolDoorMessageArrived, doorMessage.getReturnCode()=10001
12 Apr 2024 10:28:42 (RemoteTransferManager) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg dcdoor35_1 DoorTransferFinished 000022AAD2089F3049B99ADBA9B368E49FFA] sending error reply 10027:Post-processing failed: No such file or directory: 000022AAD2089F3049B99ADBA9B368E49FFA for id: 1712931393679000

Billing DB log

04.12 10:21:34 [door:WebDAV2-dcdoor35-externalipv6@webdav2-dcdoor35_httpsDomain:remove] ["usatlas1":6435:31152:unknown] [000022AAD2089F3049B99ADBA9B368E49FFA,0] [/pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1] <Unknown> 0 0 {0:""}
04.12 10:28:42 [pool:dcdoor35_1@dcdoor35oneDomain:remove] [000022AAD2089F3049B99ADBA9B368E49FFA,10939842219] [Unknown] MCTAPE:MC@osm {0:"Transfer failed"}
04.12 10:28:42 [pool:dcdoor35_1:transfer] [000022AAD2089F3049B99ADBA9B368E49FFA,0] [/pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1] MCTAPE:MC@osm 10939842219 728803 true {RemoteHttpsDataTransfer-1.1:https://webdav.mwt2.org:2881/atlasdatadisk/rucio/mc16_13TeV/01/b6/AOD.37230856._000004.pool.root.1?copy_mode=pull} [door:RemoteTransferManager@dccore03Domain:1712931393679-12115231] {10001:"Post-processing failed: No such file or directory: 000022AAD2089F3049B99ADBA9B368E49FFA"}
04.12 10:16:33 [door:RemoteTransferManager@dccore03Domain:request] ["usatlas1":6435:31152:unknown] [,0] [/pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1] MCTAPE:MC@osm 729284 22 {10027:"Post-processing failed: No such file or directory: 000022AAD2089F3049B99ADBA9B368E49FFA"}
12 Apr 2024 10:28:42 (dcdoor35_1) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg RemoteTransferManager PoolAcceptFile 000022AAD2089F3049B99ADBA9B368E49FFA] Transfer failed in post-processing: No such file or directory: 000022AAD2089F3049B99ADBA9B368E49FFA
12 Apr 2024 10:28:42 (dcdoor35_1) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg RemoteTransferManager PoolAcceptFile 000022AAD2089F3049B99ADBA9B368E49FFA] Failed to read file size: java.nio.file.NoSuchFileException: /data/dcache_pool_1/pool/data/000022AAD2089F3049B99ADBA9B368E49FFA

The door show

12 Apr 2024 10:28:42 (dcdoor35_1) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg RemoteTransferManager PoolAcceptFile 000022AAD2089F3049B99ADBA9B368E49FFA] Transfer failed in post-processing: No such file or directory: 000022AAD2089F3049B99ADBA9B368E49FFA
12 Apr 2024 10:28:42 (dcdoor35_1) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg RemoteTransferManager PoolAcceptFile 000022AAD2089F3049B99ADBA9B368E49FFA] Failed to read file size: java.nio.file.NoSuchFileException: /data/dcache_pool_1/pool/data/000022AAD2089F3049B99ADBA9B368E49FFA

I do not understand the time gap for RTM events, it is logged 10 min after.

See this extract:

12 Apr 2024 10:16:33 (RemoteTransferManager) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg WebDAV2-dcdoor23-externalipv6 RemoteTransferManager] Pool PoolName=dcdoor35_1 PoolAddress=dcdoor35_1@dcdoor35oneDomain will deliver file 000022AAD2089F3049B99ADBA9B368E49FFA mover id is 100683981
12 Apr 2024 10:28:42 (RemoteTransferManager) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg dcdoor35_1 DoorTransferFinished 000022AAD2089F3049B99ADBA9B368E49FFA] poolDoorMessageArrived, doorMessage.getReturnCode()=10001
12 Apr 2024 10:28:42 (RemoteTransferManager) [door:WebDAV2-dcdoor23-externalipv6@webdav2-dcdoor23_httpsDomain:AAYV5us+Ifg dcdoor35_1 DoorTransferFinished 000022AAD2089F3049B99ADBA9B368E49FFA] sending error reply 10027:Post-processing failed: No such file or directory: 000022AAD2089F3049B99ADBA9B368E49FFA for id: 1712931393679000

Any ideas?

All the best, Carlos

vokac commented 2 months ago

Full HTTP-TPC details from client side (FTS/gfal2)

INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: Hop: https://dcdoor-tape.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: > COPY /pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1 HTTP/1.1
> User-Agent: fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29
> TE: trailers
> Host: dcdoor-tape.usatlas.bnl.gov
> Source: https://webdav.mwt2.org:2881/atlasdatadisk/rucio/mc16_13TeV/01/b6/AOD.37230856._000004.pool.root.1?copy_mode=pull
> X-Number-Of-Streams: 1
> Secure-Redirection: 1
> ClientInfo: job-id=d9b8ce3e-f8d6-11ee-9956-00163e1051a1;file-id=2017106869;retry=0
> TransferHeaderAuthorization:  SECRET1
> Credential: none
> Authorization: SECRET2
> RequireChecksumVerification: false
> TransferMetadata: eyJjb2xsb2NhdGlvbl9oaW50cyI6IHt9fQ==
> SciTag: 132
> 

INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < HTTP/1.1 202 Accepted
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Date: Fri, 12 Apr 2024 14:16:33 GMT
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < X-Content-Type-Options: nosniff
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < X-Frame-Options: SAMEORIGIN
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < X-XSS-Protection: 1; mode=block
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Server: dCache/9.2.6
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Content-Security-Policy: default-src 'none' ; img-src 'self' data: ; style-src 'self' 'unsafe-inline' ; script-src 'self'; font-src 'self'
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Referrer-Policy: strict-origin-when-cross-origin
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Content-Type: text/perf-marker-stream
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Transfer-Encoding: chunked
INFO    Fri, 12 Apr 2024 10:16:33 -0400; Davix: < 
INFO    Fri, 12 Apr 2024 10:21:33 -0400; Gfal2: Copy failed with mode 3rd pull: Connection terminated abruptly; Status of TPC request unknown

This is followed by PROPFIND

INFO    Fri, 12 Apr 2024 10:21:34 -0400; Davix: > PROPFIND /pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1 HTTP/1.1
...
INFO    Fri, 12 Apr 2024 10:21:34 -0400; Davix: < HTTP/1.1 207 Multi-status

and DELETE

INFO    Fri, 12 Apr 2024 10:21:34 -0400; Davix: > DELETE /pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1 HTTP/1.1
...
INFO    Fri, 12 Apr 2024 10:21:34 -0400; Davix: < HTTP/1.1 204 No Content
cfgamboa commented 2 months ago

Hello,

I was able to issue an FTS transfer from CERN to BNL here the job, https://fts.usatlas.bnl.gov:8449/fts3/ftsmon/#/job/ae540026-fac8-11ee-8775-00163e1051a5

The transfer fails with the same error.

I was able to capture the information at the door and pool handling the transfers

At pool managing the TPC request:

14 Apr 2024 21:42:05 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:15 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:25 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:35 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:45 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:55 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:05 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:15 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:25 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:35 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:45 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:55 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:44:00 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] http-outgoing-65053: Close connection
14 Apr 2024 21:44:00 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Connection discarded
14 Apr 2024 21:44:00 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Connection released: [id: 65053][route: {}->http://st-096-hh151c66.cern.ch:8443][total available: 15; route allocated: 0 of 200; total allocated: 85 of 200]
14 Apr 2024 21:44:01 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Adjusting allocation: allocated: 16066304309, file size: 16066304309
14 Apr 2024 21:44:01 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] remove entry 00000BF9909B3C864DD089A84BC286A5D2B1: Transfer failed
14 Apr 2024 21:44:01 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_CLOSE_WRITE event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:44:03 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] cleaning up 00000BF9909B3C864DD089A84BC286A5D2B1
14 Apr 2024 21:44:03 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Transfer failed in post-processing: No such file or directory: 00000BF9909B3C864DD089A84BC286A5D2B1
14 Apr 2024 21:44:03 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Failed to read file size: java.nio.file.NoSuchFileException: /data/dcache_pool_1/pool/data/00000BF9909B3C864DD089A84BC286A5D2B1

The door

[root@dcdoor31 ~]# cat /var/log/dcache/webdav2-dcdoor31_httpsDomain.access|grep gamboa
level=WARN ts=2024-04-14T21:37:15.213-0400 event=org.dcache.webdav.request request.method=PROPFIND request.url=https://dcgftp.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 response.code=404 response.reason="Not Found" socket.remote=[2620:0:210:1::41]:59562 user-agent="fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29" user.dn="CN=1713141921,CN=108542548,CN=Carlos Fernando Gamboa,CN=658196,CN=cgamboa,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=9102:31152,69907[6VTqkEtm:7i7eY02m] duration=7
level=WARN ts=2024-04-14T21:37:15.232-0400 event=org.dcache.webdav.request request.method=PROPFIND request.url=https://dcgftp.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 response.code=404 response.reason="Not Found" socket.remote=[2620:0:210:1::41]:59562 user-agent="fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29" user.dn="CN=1713141921,CN=108542548,CN=Carlos Fernando Gamboa,CN=658196,CN=cgamboa,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=9102:31152,69907[6VTqkEtm:7i7eY02m] duration=7
level=INFO ts=2024-04-14T21:37:15.256-0400 event=org.dcache.webdav.request request.method=PROPFIND request.url=https://dcgftp.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/users/cgamboa response.code=207 response.reason=Multi-status socket.remote=[2620:0:210:1::41]:59562 user-agent="fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29" user.dn="CN=1713141921,CN=108542548,CN=Carlos Fernando Gamboa,CN=658196,CN=cgamboa,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=9102:31152,69907[6VTqkEtm:7i7eY02m] duration=15
level=WARN ts=2024-04-14T21:44:03.465-0400 event=org.dcache.webdav.request request.method=COPY request.url=https://dcgftp.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 response.code=202 socket.remote=[2620:0:210:1::41]:59562 user-agent="fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29" user.dn="CN=1713141921,CN=108542548,CN=Carlos Fernando Gamboa,CN=658196,CN=cgamboa,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=9102:31152,69907[6VTqkEtm:7i7eY02m] tpc.credential=none tpc.error="Post-processing failed: No such file or directory: 00000BF9909B3C864DD089A84BC286A5D2B1" tpc.require-checksum=false tpc.source=https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull duration=408198

The DAV door is still interacting with the transfer, it seemed that the door could not get back to the RTM.

Failed to deliver TransferStatusQueryMessage message.

This is probably related to https://github.com/dCache/dcache/issues/7519 ?

From the transfer manager

RemoteTransferManager@dccore02Domain:
    id: 1713145035278000
        Source: https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull
        Destination: /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4
        State: Mover created
        User: cgamboa
        Restriction: CompositeRestriction[PrefixRestrict[prefix=/pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4], Restrict[DOWNLOAD,UPDATE_METADATA]]
        PNFS-ID: 00000BF9909B3C864DD089A84BC286A5D2B1
        Access latency: ONLINE
        Retention policy: REPLICA
        Storage class: cgamboa:USERS
        Pool: PoolName=dcdoor29_1 PoolAddress=dcdoor29_1@dcdoor29oneDomain
        Mover: 100717339
RemoteTransferManager@dccore03Domain:
    ID not found : 1713145035278000

Tables reflect the proper state

        ID        | CREATED | CREATIONTIME  | CREDENTIALID | LIFETIME | MOOVERID  | PNFSID |                     LOCALPATH                      |                                                                                  REMOTEURL                                
                                                   | STATE | WRITE 
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------+-------+-------
 1713145035278000 | f       | 1713145035278 |            0 |        0 | 100717339 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.me
rge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    10 | t
(1 row)

When the transfer fails the request is moved to the backup

        ID        | CREATED | CREATIONTIME  | CREDENTIALID | LIFETIME | MOOVERID  | PNFSID |                                                               LOCALPATH                                                                |  POOLNAME  |                          
                                                        REMOTEURL                                                                                   | STATE | WRITE 
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------+-------+-------

443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -1 | t
 1713145035278000 | f       | 1713145035278 |            0 |        0 | 100717339 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4  
        ID        | CREATED | CREATIONTIME  | CREDENTIALID | LIFETIME | MOOVERID  | PNFSID |                     LOCALPATH                      |  POOLNAME  |                                                                                  REMOTEURL                   
                                                                | STATE | WRITE 
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------+------------+--------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------+-------+-------
 1713145035278000 | f       | 1713145035278 |            0 |        0 | 100717339 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 | dcdoor29_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.expre
ss_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -1 | t
(1 row)

After restarting the door seemed to alleviate this.

Carlos

vingar commented 2 months ago

It seems that restarting the doors may have caused an inconsistency between the RemoteTransferManager's entries (\s RemoteTransferManager@* ls external) and the corresponding entries in the "REMOTETRANSFERS" table. Some entries in the database are no longer valid transfers in the RemoteTransferManager.

lemora commented 2 months ago

Hey. You appear to run multiple transfermanagers in HA mode; do they share the same database, as recommended? This looks to me like one RTM knows about the transfer, one does not, maybe due to different states/databases? If the wrong one is asked, it does not know the ID:

RemoteTransferManager@dccore02Domain:
    id: 1713145035278000
    [...]
RemoteTransferManager@dccore03Domain:
    ID not found : 1713145035278000
vingar commented 2 months ago

Hi,

three transfermanagers run and each of them have a dedicated database. This setting was recommended by Dmitry in [www.dcache.org #10506]

Please make sure that each RemoteTransferManager instance has its own database associated. (the database can be hosted by the same host, just create multiple of them

createdb transfermanagers1 createdb transfermanagers2

and then adjust

transfermanagers.db.name

accordingly for each RemoteTransferManager cell.

ps: RemoteTransferManager is missing here https://www.dcache.org/manuals/Book-9.2/cookbook-ha-with-replicable-services.shtml

cfgamboa commented 2 months ago

By all means not to deviate the main topic here, but what would be the recommended procedure to maintain REMOTETRANFERS_B tables seemed to have the history of the REMOTETRASFERS requests?

        ID        | CREATED | CREATIONTIME  | CREDENTIALID | LIFETIME | MOOVERID  | PNFSID |                                                               LOCALPATH                                                                
|  POOLNAME  |                                                                                  REMOTEURL                                                                                   | STATE | WRITE 
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------------------------------------------------------------------------------------------
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-------
 1713139647002000 | f       | 1713139647002 |            0 |        0 | 100719601 |        | /pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1 
| dcdoor34_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -1 | t
 1713145035278000 | f       | 1713145035278 |            0 |        0 | 100717339 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4                                                                                     
| dcdoor29_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -1 | t
 1713146514961000 | f       | 1713146514961 |            0 |        0 | 100716133 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.6                                                                                     
| dcdoor25_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -2 | t
 1713142071348000 | f       | 1713142071348 |            0 |        0 | 100715611 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.2                                                                                     
| dcdoor21_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -1 | t
transfermanagers-dccore02=# select count(*) FROM "REMOTETRANSFERS_B";
  count   
----------
 13701128
(1 row)

transfermanagers-dccore02=# \c transfermanagers-dccore01
You are now connected to database "transfermanagers-dccore01" as user "postgres".
transfermanagers-dccore01=# select count(*) FROM "REMOTETRANSFERS_B";
  count   
----------
 13710601
(1 row)

transfermanagers-dccore01=# \c transfermanagers-dccore03
You are now connected to database "transfermanagers-dccore03" as user "postgres".
transfermanagers-dccore03=# select count(*) FROM "REMOTETRANSFERS_B";
  count   
----------
 13708792
(1 row)

Does this information is needed by the system to clean this table via dCache?

DmitryLitvintsev commented 2 months ago

By all means not to deviate the main topic here, but what would be the recommended procedure to maintain REMOTETRANFERS_B tables seemed to have the history of the REMOTETRASFERS requests?

        ID        | CREATED | CREATIONTIME  | CREDENTIALID | LIFETIME | MOOVERID  | PNFSID |                                                               LOCALPATH                                                                
|  POOLNAME  |                                                                                  REMOTEURL                                                                                   | STATE | WRITE 
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------------------------------------------------------------------------------------------
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-------
 1713139647002000 | f       | 1713139647002 |            0 |        0 | 100719601 |        | /pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1 
| dcdoor34_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -1 | t
 1713145035278000 | f       | 1713145035278 |            0 |        0 | 100717339 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4                                                                                     
| dcdoor29_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -1 | t
 1713146514961000 | f       | 1713146514961 |            0 |        0 | 100716133 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.6                                                                                     
| dcdoor25_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -2 | t
 1713142071348000 | f       | 1713142071348 |            0 |        0 | 100715611 |        | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.2                                                                                     
| dcdoor21_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull |    -1 | t
transfermanagers-dccore02=# select count(*) FROM "REMOTETRANSFERS_B";
  count   
----------
 13701128
(1 row)

transfermanagers-dccore02=# \c transfermanagers-dccore01
You are now connected to database "transfermanagers-dccore01" as user "postgres".
transfermanagers-dccore01=# select count(*) FROM "REMOTETRANSFERS_B";
  count   
----------
 13710601
(1 row)

transfermanagers-dccore01=# \c transfermanagers-dccore03
You are now connected to database "transfermanagers-dccore03" as user "postgres".
transfermanagers-dccore03=# select count(*) FROM "REMOTETRANSFERS_B";
  count   
----------
 13708792
(1 row)

Does this information is needed by the system to clean this table via dCache?

truncate table "REMOTETRANSFERS_B";

The logic here was - "*_B" tables keep transfer history data. I realize we do not have mechanism of periodic cleanup like we do for srm (srmmanager.persistence.keep-history-period). Likely we need to add this.

You can disable database altogether by setting:

transfermanagers.enable.db=false

And it will work without database just fine.

On restart transfer manager never did any recovery from db, so all this persistency is not really used.

It was used in the past purely for monitoring purposes - to monitor transfer states.

As for why you have this fail - I have no idea at the moment, I think need need to look at pool log, desirably in debug log.

Question - is it 100% failure, has it worked before and when it started to fail and if any changes were done that correlate with fails?

cfgamboa commented 2 months ago

Hi @DmitryLitvintsev Thank you we will truncate the table tomorrow. It is not clear to me if the RTM is indeed affected or communication among parties during TPC process

kofemann commented 2 months ago

@cfgamboa

RemoteTransferManager@dccore02Domain:
    id: 1713145035278000
...
RemoteTransferManager@dccore03Domain:
    ID not found : 1713145035278000

I assume you did \s RemoteTransferManager@* ls external <id>, right?

kofemann commented 2 months ago

BTW, It looks like the databases play no role (shared or not), as the records never read back :)

cfgamboa commented 2 months ago

@cfgamboa

RemoteTransferManager@dccore02Domain:
    id: 1713145035278000
...
RemoteTransferManager@dccore03Domain:
    ID not found : 1713145035278000

I assume you did \s RemoteTransferManager@* ls external <id>, right?

Indeed.

DmitryLitvintsev commented 2 months ago

BTW, It looks like the databases play no role (shared or not), as the records never read back :)

yes.. DB plays no role. This is what I tried to express in my reply to this issue. They may as well just disable it:

transfermanagers.enable.db=false

May be an action item - excise all peristency code out of RTM. We have billing records for monitoring anyway.

cfgamboa commented 2 months ago

I have truncated those tables thank you Dmitry.

kofemann commented 2 months ago

I had a deeper look at the RTM logic. Here are my discoveries

This means, that if we have multiple RTM and they have started transfers at the same point in time, then one of them will be lost and as soon as one of them is complete, the second transfer becomes orphan:

Perf Marker
    Timestamp: 1713527368
    State: Running
    State description: Mover created
    Stripe Index: 0
    Stripe Start Time: 1713527348
    Stripe Last Transferred: 1713527348
    Stripe Transfer Time: 19
    Stripe Bytes Transferred: 0
    Stripe Status: RUNNING
    Total Stripe Count: 1
    RemoteConnections: tcp:127.0.0.1:9000
End
Perf Marker
    Timestamp: 1713527373
    State: Unknown transfer
    State description: Unknown transfer
    Stripe Index: 0
    Total Stripe Count: 1
End
kofemann commented 2 months ago

https://rb.dcache.org/r/14247/

vingar commented 2 months ago

Hi @kofemann

This means, that if we have multiple RTM and they have started transfers at the same point in time, then one of them will be lost and as soon as one of them is complete, the second transfer becomes orphan:

Can it explain the transfers without performance markers ?

kofemann commented 2 months ago

A5 least I was able to simulate such situations by using manual clock.