Closed cfgamboa closed 2 months ago
Full HTTP-TPC details from client side (FTS/gfal2)
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: Hop: https://dcdoor-tape.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: > COPY /pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1 HTTP/1.1
> User-Agent: fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29
> TE: trailers
> Host: dcdoor-tape.usatlas.bnl.gov
> Source: https://webdav.mwt2.org:2881/atlasdatadisk/rucio/mc16_13TeV/01/b6/AOD.37230856._000004.pool.root.1?copy_mode=pull
> X-Number-Of-Streams: 1
> Secure-Redirection: 1
> ClientInfo: job-id=d9b8ce3e-f8d6-11ee-9956-00163e1051a1;file-id=2017106869;retry=0
> TransferHeaderAuthorization: SECRET1
> Credential: none
> Authorization: SECRET2
> RequireChecksumVerification: false
> TransferMetadata: eyJjb2xsb2NhdGlvbl9oaW50cyI6IHt9fQ==
> SciTag: 132
>
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < HTTP/1.1 202 Accepted
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Date: Fri, 12 Apr 2024 14:16:33 GMT
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < X-Content-Type-Options: nosniff
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < X-Frame-Options: SAMEORIGIN
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < X-XSS-Protection: 1; mode=block
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Server: dCache/9.2.6
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Content-Security-Policy: default-src 'none' ; img-src 'self' data: ; style-src 'self' 'unsafe-inline' ; script-src 'self'; font-src 'self'
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Referrer-Policy: strict-origin-when-cross-origin
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Content-Type: text/perf-marker-stream
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: < Transfer-Encoding: chunked
INFO Fri, 12 Apr 2024 10:16:33 -0400; Davix: <
INFO Fri, 12 Apr 2024 10:21:33 -0400; Gfal2: Copy failed with mode 3rd pull: Connection terminated abruptly; Status of TPC request unknown
This is followed by PROPFIND
INFO Fri, 12 Apr 2024 10:21:34 -0400; Davix: > PROPFIND /pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1 HTTP/1.1
...
INFO Fri, 12 Apr 2024 10:21:34 -0400; Davix: < HTTP/1.1 207 Multi-status
and DELETE
INFO Fri, 12 Apr 2024 10:21:34 -0400; Davix: > DELETE /pnfs/usatlas.bnl.gov/MCTAPE/mc16_13TeV/AOD/e8557_e8455_a906_r10724_r10726/mc16_13TeV.603012.PhPy8EG_ttbar_madspin_smspecialb_singlelep.merge.AOD.e8557_e8455_a906_r10724_r10726_tid37230856_00/AOD.37230856._000004.pool.root.1 HTTP/1.1
...
INFO Fri, 12 Apr 2024 10:21:34 -0400; Davix: < HTTP/1.1 204 No Content
Hello,
I was able to issue an FTS transfer from CERN to BNL here the job, https://fts.usatlas.bnl.gov:8449/fts3/ftsmon/#/job/ae540026-fac8-11ee-8775-00163e1051a5
The transfer fails with the same error.
I was able to capture the information at the door and pool handling the transfers
At pool managing the TPC request:
14 Apr 2024 21:42:05 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:15 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:25 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:35 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:45 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:42:55 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:05 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:15 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:25 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:35 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:45 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:43:55 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_MODIFY event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:44:00 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] http-outgoing-65053: Close connection
14 Apr 2024 21:44:00 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Connection discarded
14 Apr 2024 21:44:00 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Connection released: [id: 65053][route: {}->http://st-096-hh151c66.cern.ch:8443][total available: 15; route allocated: 0 of 200; total allocated: 85 of 200]
14 Apr 2024 21:44:01 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Adjusting allocation: allocated: 16066304309, file size: 16066304309
14 Apr 2024 21:44:01 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] remove entry 00000BF9909B3C864DD089A84BC286A5D2B1: Transfer failed
14 Apr 2024 21:44:01 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Sending IN_CLOSE_WRITE event for 00000BF9909B3C864DD089A84BC286A5D2B1 and parents
14 Apr 2024 21:44:03 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] cleaning up 00000BF9909B3C864DD089A84BC286A5D2B1
14 Apr 2024 21:44:03 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Transfer failed in post-processing: No such file or directory: 00000BF9909B3C864DD089A84BC286A5D2B1
14 Apr 2024 21:44:03 (dcdoor29_1) [door:WebDAV2-dcdoor31-externalipv6@webdav2-dcdoor31_httpsDomain:AAYWGKlGl9A RemoteTransferManager PoolAcceptFile 00000BF9909B3C864DD089A84BC286A5D2B1] Failed to read file size: java.nio.file.NoSuchFileException: /data/dcache_pool_1/pool/data/00000BF9909B3C864DD089A84BC286A5D2B1
The door
[root@dcdoor31 ~]# cat /var/log/dcache/webdav2-dcdoor31_httpsDomain.access|grep gamboa
level=WARN ts=2024-04-14T21:37:15.213-0400 event=org.dcache.webdav.request request.method=PROPFIND request.url=https://dcgftp.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 response.code=404 response.reason="Not Found" socket.remote=[2620:0:210:1::41]:59562 user-agent="fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29" user.dn="CN=1713141921,CN=108542548,CN=Carlos Fernando Gamboa,CN=658196,CN=cgamboa,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=9102:31152,69907[6VTqkEtm:7i7eY02m] duration=7
level=WARN ts=2024-04-14T21:37:15.232-0400 event=org.dcache.webdav.request request.method=PROPFIND request.url=https://dcgftp.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 response.code=404 response.reason="Not Found" socket.remote=[2620:0:210:1::41]:59562 user-agent="fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29" user.dn="CN=1713141921,CN=108542548,CN=Carlos Fernando Gamboa,CN=658196,CN=cgamboa,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=9102:31152,69907[6VTqkEtm:7i7eY02m] duration=7
level=INFO ts=2024-04-14T21:37:15.256-0400 event=org.dcache.webdav.request request.method=PROPFIND request.url=https://dcgftp.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/users/cgamboa response.code=207 response.reason=Multi-status socket.remote=[2620:0:210:1::41]:59562 user-agent="fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29" user.dn="CN=1713141921,CN=108542548,CN=Carlos Fernando Gamboa,CN=658196,CN=cgamboa,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=9102:31152,69907[6VTqkEtm:7i7eY02m] duration=15
level=WARN ts=2024-04-14T21:44:03.465-0400 event=org.dcache.webdav.request request.method=COPY request.url=https://dcgftp.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 response.code=202 socket.remote=[2620:0:210:1::41]:59562 user-agent="fts_url_copy/3.12.11 gfal2/2.22.2 neon/0.0.29" user.dn="CN=1713141921,CN=108542548,CN=Carlos Fernando Gamboa,CN=658196,CN=cgamboa,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=9102:31152,69907[6VTqkEtm:7i7eY02m] tpc.credential=none tpc.error="Post-processing failed: No such file or directory: 00000BF9909B3C864DD089A84BC286A5D2B1" tpc.require-checksum=false tpc.source=https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull duration=408198
The DAV door is still interacting with the transfer, it seemed that the door could not get back to the RTM.
Failed to deliver TransferStatusQueryMessage message.
This is probably related to https://github.com/dCache/dcache/issues/7519 ?
From the transfer manager
RemoteTransferManager@dccore02Domain:
id: 1713145035278000
Source: https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull
Destination: /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4
State: Mover created
User: cgamboa
Restriction: CompositeRestriction[PrefixRestrict[prefix=/pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4], Restrict[DOWNLOAD,UPDATE_METADATA]]
PNFS-ID: 00000BF9909B3C864DD089A84BC286A5D2B1
Access latency: ONLINE
Retention policy: REPLICA
Storage class: cgamboa:USERS
Pool: PoolName=dcdoor29_1 PoolAddress=dcdoor29_1@dcdoor29oneDomain
Mover: 100717339
RemoteTransferManager@dccore03Domain:
ID not found : 1713145035278000
Tables reflect the proper state
ID | CREATED | CREATIONTIME | CREDENTIALID | LIFETIME | MOOVERID | PNFSID | LOCALPATH | REMOTEURL
| STATE | WRITE
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------+-------+-------
1713145035278000 | f | 1713145035278 | 0 | 0 | 100717339 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.me
rge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | 10 | t
(1 row)
When the transfer fails the request is moved to the backup
ID | CREATED | CREATIONTIME | CREDENTIALID | LIFETIME | MOOVERID | PNFSID | LOCALPATH | POOLNAME |
REMOTEURL | STATE | WRITE
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------+-------+-------
443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -1 | t
1713145035278000 | f | 1713145035278 | 0 | 0 | 100717339 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4
ID | CREATED | CREATIONTIME | CREDENTIALID | LIFETIME | MOOVERID | PNFSID | LOCALPATH | POOLNAME | REMOTEURL
| STATE | WRITE
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------+------------+--------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------+-------+-------
1713145035278000 | f | 1713145035278 | 0 | 0 | 100717339 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 | dcdoor29_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.expre
ss_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -1 | t
(1 row)
After restarting the door seemed to alleviate this.
Carlos
It seems that restarting the doors may have caused an inconsistency between the RemoteTransferManager's entries (\s RemoteTransferManager@* ls external
) and the corresponding entries in the "REMOTETRANSFERS" table. Some entries in the database are no longer valid transfers in the RemoteTransferManager.
Hey. You appear to run multiple transfermanagers in HA mode; do they share the same database, as recommended? This looks to me like one RTM knows about the transfer, one does not, maybe due to different states/databases? If the wrong one is asked, it does not know the ID:
RemoteTransferManager@dccore02Domain:
id: 1713145035278000
[...]
RemoteTransferManager@dccore03Domain:
ID not found : 1713145035278000
Hi,
three transfermanagers run and each of them have a dedicated database. This setting was recommended by Dmitry in [www.dcache.org #10506]
Please make sure that each RemoteTransferManager instance has its own database associated. (the database can be hosted by the same host, just create multiple of them
createdb transfermanagers1 createdb transfermanagers2
and then adjust
transfermanagers.db.name
accordingly for each RemoteTransferManager cell.
ps: RemoteTransferManager is missing here https://www.dcache.org/manuals/Book-9.2/cookbook-ha-with-replicable-services.shtml
By all means not to deviate the main topic here, but what would be the recommended procedure to maintain REMOTETRANFERS_B tables seemed to have the history of the REMOTETRASFERS requests?
ID | CREATED | CREATIONTIME | CREDENTIALID | LIFETIME | MOOVERID | PNFSID | LOCALPATH
| POOLNAME | REMOTEURL | STATE | WRITE
------------------+---------+---------------+--------------+----------+-----------+--------+----------------------------------------------------------------------------------------------------------------------------------------
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-------
1713139647002000 | f | 1713139647002 | 0 | 0 | 100719601 | | /pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1
| dcdoor34_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -1 | t
1713145035278000 | f | 1713145035278 | 0 | 0 | 100717339 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4
| dcdoor29_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -1 | t
1713146514961000 | f | 1713146514961 | 0 | 0 | 100716133 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.6
| dcdoor25_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -2 | t
1713142071348000 | f | 1713142071348 | 0 | 0 | 100715611 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.2
| dcdoor21_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -1 | t
transfermanagers-dccore02=# select count(*) FROM "REMOTETRANSFERS_B";
count
----------
13701128
(1 row)
transfermanagers-dccore02=# \c transfermanagers-dccore01
You are now connected to database "transfermanagers-dccore01" as user "postgres".
transfermanagers-dccore01=# select count(*) FROM "REMOTETRANSFERS_B";
count
----------
13710601
(1 row)
transfermanagers-dccore01=# \c transfermanagers-dccore03
You are now connected to database "transfermanagers-dccore03" as user "postgres".
transfermanagers-dccore03=# select count(*) FROM "REMOTETRANSFERS_B";
count
----------
13708792
(1 row)
Does this information is needed by the system to clean this table via dCache?
By all means not to deviate the main topic here, but what would be the recommended procedure to maintain REMOTETRANFERS_B tables seemed to have the history of the REMOTETRASFERS requests?
ID | CREATED | CREATIONTIME | CREDENTIALID | LIFETIME | MOOVERID | PNFSID | LOCALPATH | POOLNAME | REMOTEURL | STATE | WRITE ------------------+---------+---------------+--------------+----------+-----------+--------+---------------------------------------------------------------------------------------------------------------------------------------- +------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------- 1713139647002000 | f | 1713139647002 | 0 | 0 | 100719601 | | /pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1 | dcdoor34_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -1 | t 1713145035278000 | f | 1713145035278 | 0 | 0 | 100717339 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.4 | dcdoor29_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -1 | t 1713146514961000 | f | 1713146514961 | 0 | 0 | 100716133 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.6 | dcdoor25_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -2 | t 1713142071348000 | f | 1713142071348 | 0 | 0 | 100715611 | | /pnfs/usatlas.bnl.gov/users/cgamboa/test.fts.dbg.2 | dcdoor21_1 | https://eosatlas.cern.ch:443/eos/atlas/atlasdatadisk/rucio/data24_13p6TeV/73/62/data24_13p6TeV.00472662.express_express.merge.AOD.f1437_m2243._lb0201._0001.1?copy_mode=pull | -1 | t
transfermanagers-dccore02=# select count(*) FROM "REMOTETRANSFERS_B"; count ---------- 13701128 (1 row) transfermanagers-dccore02=# \c transfermanagers-dccore01 You are now connected to database "transfermanagers-dccore01" as user "postgres". transfermanagers-dccore01=# select count(*) FROM "REMOTETRANSFERS_B"; count ---------- 13710601 (1 row) transfermanagers-dccore01=# \c transfermanagers-dccore03 You are now connected to database "transfermanagers-dccore03" as user "postgres". transfermanagers-dccore03=# select count(*) FROM "REMOTETRANSFERS_B"; count ---------- 13708792 (1 row)
Does this information is needed by the system to clean this table via dCache?
truncate table "REMOTETRANSFERS_B";
The logic here was - "*_B" tables keep transfer history data. I realize we do not have mechanism of periodic cleanup like we do for srm (srmmanager.persistence.keep-history-period
). Likely we need to add this.
You can disable database altogether by setting:
transfermanagers.enable.db=false
And it will work without database just fine.
On restart transfer manager never did any recovery from db, so all this persistency is not really used.
It was used in the past purely for monitoring purposes - to monitor transfer states.
As for why you have this fail - I have no idea at the moment, I think need need to look at pool log, desirably in debug log.
Question - is it 100% failure, has it worked before and when it started to fail and if any changes were done that correlate with fails?
Hi @DmitryLitvintsev Thank you we will truncate the table tomorrow. It is not clear to me if the RTM is indeed affected or communication among parties during TPC process
@cfgamboa
RemoteTransferManager@dccore02Domain:
id: 1713145035278000
...
RemoteTransferManager@dccore03Domain:
ID not found : 1713145035278000
I assume you did \s RemoteTransferManager@* ls external <id>
, right?
BTW, It looks like the databases play no role (shared or not), as the records never read back :)
@cfgamboa
RemoteTransferManager@dccore02Domain: id: 1713145035278000 ... RemoteTransferManager@dccore03Domain: ID not found : 1713145035278000
I assume you did
\s RemoteTransferManager@* ls external <id>
, right?
Indeed.
BTW, It looks like the databases play no role (shared or not), as the records never read back :)
yes.. DB plays no role. This is what I tried to express in my reply to this issue. They may as well just disable it:
transfermanagers.enable.db=false
May be an action item - excise all peristency code out of RTM. We have billing records for monitoring anyway.
I have truncated those tables thank you Dmitry.
I had a deeper look at the RTM logic. Here are my discoveries
public synchronized long next() {
return (_last = Math.max(_last + 1, System.currentTimeMillis() * 1_000));
}
WebDAV door keeps track of transfers based on transfer IDs:
private final Map<Long, RemoteTransfer> _transfers = new ConcurrentHashMap<>();
...
{
_id = response.getId();
_transfers.put(_id, this);
}
This means, that if we have multiple RTM and they have started transfers at the same point in time, then one of them will be lost and as soon as one of them is complete, the second transfer becomes orphan:
Perf Marker
Timestamp: 1713527368
State: Running
State description: Mover created
Stripe Index: 0
Stripe Start Time: 1713527348
Stripe Last Transferred: 1713527348
Stripe Transfer Time: 19
Stripe Bytes Transferred: 0
Stripe Status: RUNNING
Total Stripe Count: 1
RemoteConnections: tcp:127.0.0.1:9000
End
Perf Marker
Timestamp: 1713527373
State: Unknown transfer
State description: Unknown transfer
Stripe Index: 0
Total Stripe Count: 1
End
Hi @kofemann
This means, that if we have multiple RTM and they have started transfers at the same point in time, then one of them will be lost and as soon as one of them is complete, the second transfer becomes orphan:
Can it explain the transfers without performance markers ?
A5 least I was able to simulate such situations by using manual clock.
Hello,
We are observing TPC transfers via FTS failing writing to BNL SE failing with FTS extract
At dCache we observed this enabling the RTM in debug mode
Billing DB log
The door show
I do not understand the time gap for RTM events, it is logged 10 min after.
See this extract:
Any ideas?
All the best, Carlos