Closed cfgamboa closed 6 months ago
Hi @cfgamboa ,
So, at a high level, what's happening here is ATLAS wants to copy a file from T0D1 (i.e., disk) storage to T1D0 (i.e., tape) storage. To do this, they are trying to use HTTP-TPC, with both endpoints being at BNL.
There are two attempts.
The first attempt is a pull-request, so the COPY request is to the T1D0 url, with a Source
header that points to the T0D1 url. This HTTP-TPC request fails with the message rejected GET: 500 Server Error
.
The second attempt is a push request, so the COPY request targets the T0D1 url, with a Destination
header that points to the T1D0 url. This HTTP-TPC request fails with the message rejected PUT: 500 Server Error
.
In both cases, we see a Task was cancelled
message, but from different pools. This message comes from the migration module on a pool, but unfortunately we don't log why this migration is being cancelled.
I would imagine the next step would be to understand why the migration module task is being cancelled.
Hello @paulmillar
Thank you for your feedback. I have updated the FTS transfer URL with the one related to this transfer.
More contextual information is that the day before we had a hardware failure affecting the SRM component. I understand that the DAV doors were restarted as well later on. However, there were many transfers failing like this one afterwards. I had to restart the Dav doors and apparently this helped on alleviating this type of problem.
What is the expected behavior of the RemoteTransferManager per failure occurrence in SRM?
Hi @cfgamboa ,
A couple of quick questions:
Which version of dCache is this?
Is the problem ongoing, or are transfers succeeding again?
SRM and HTTP-TPC are unrelated; however, both the SRM and WebDAV doors use the transfermanagers
service to orchestrate third-party transfers (SRM third-party transfers are no longer used by anyone). I believe sites often run the transfermanagers
service close to the srmmanager
service (e.g., in the same domain), so a hardware failure may have affected both.
The WebDAV door should recover from transfermanagers
service being restarted, but this has been problematic in the past.
I've created a patch that should update the message Task was cancelled
to include the reason why the task was cancelled. Once this patch passes code review and is part of a bug-fix release, you would need to update the pools to benefit from the change.
That said, I think the migration task being cancelled is likely a result of some underlying problem, rather than the cause of the transfers failing.
HTH, Paul.
Hello @paulmillar
Which version of dCache is this? dcache-7.2.7 for doors and SRM, other components are 7.2.3
Is the problem ongoing, or are transfers succeeding again? Yes after restarting the doors the transfers succeeding again.
Indeed, the transfermanagers is part cell resource in the SRM domain.
Thank you for your feedback and work for in the patch.
All the best, Carlos
Hello all,
we have seen many transfers failing with
rejected PUT: 500 Server Error
https://fts304.usatlas.bnl.gov:8449/var/log/fts3/transfers/2022-03-21/dcsrm.usatlas.bnl.gov__dcgftp.usatlas.bnl.gov/2022-03-21-1708__dcsrm.usatlas.bnl.gov__dcgftp.usatlas.bnl.gov__1282623060__d7b80cb2-a905-11ec-8f8e-b49691292ed8At the billing log the file appears to be available for copy, here the source is PNSFID 0000EEC369BA76464902A124C6842E7DE7CA, file name:
Looking at the Billinglogs for that transfer
03.21 17:08:42 [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:request] ["usatlas1":6435:31152:192.12.15.232] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 5 0 {10011:"Task was cancelled."}
03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 0 0 true {RemoteHttpsDataTransfer-1.1:https://dcdoor12.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922644-162976] {10027:"rejected GET: 500 Server Error"}
03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 82 40 {10027:"rejected GET: 500 Server Error"}
03.21 17:08:55 [door:WebDAV2-dcdoor08-2@webdav2-dcdoor08_httpsDomain:request] ["usatlas1":6435:31152:10.42.38.52] [0000F1646266F48845FE8BBB730D97F75E69,2621829888] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 15 0 {10011:"Task was cancelled."}
03.21 17:08:55 [pool:dcdoor08_1:transfer] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 0 0 false {RemoteHttpsDataTransfer-1.1:https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922851-162979] {10027:"rejected PUT: 500 Server Error"}
03.21 17:08:55 [pool:dcdoor08_1:transfer] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 0 0 false {RemoteHttpsDataTransfer-1.1:https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922851-162979] {10027:"rejected PUT: 500 Server Error"}
03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000EEC369BA76464902A124C6842E7DE7CA,0] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 12275 12216 {10027:"rejected PUT: 500 Server Error"}
03.21 18:00:34 [door:WebDAV-dcdoor18-2@webdav-dcdoor18_httpsDomain:request] ["usatlas1":6435:31152:192.12.15.232] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 17846 0 {0:""} 03.21 18:00:36 [pool:dcdoor11_1:transfer] [0000B8E98835C401469584C9A75D2D38F865,2621829888] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 2621829888 17902 true {RemoteHttpsDataTransfer-1.1:https://dcdoor18.usatlas.bnl.gov:2881/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647900017018-168986] {0:""} 03.21 18:00:17 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000B8E98835C401469584C9A75D2D38F865,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 20522 38 {0:""}
[root@dc209 dcache]# ls -l /data3/dcache_pool_38/pool/data/0000EEC369BA76464902A124C6842E7DE7CA -rw-r--r-- 1 root root 2621829888 Mar 21 17:05 /data3/dcache_pool_38/pool/data/0000EEC369BA76464902A124C6842E7DE7CA
[root@dcmaint02 03]# grep 0000D1ACF85F8E9148A7875B57F87F66344B billing-2022.03.21 03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 0 0 true {RemoteHttpsDataTransfer-1.1:https://dcdoor12.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922644-162976] {10027:"rejected GET: 500 Server Error"} 03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 82 40 {10027:"rejected GET: 500 Server Error"} 03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm 0 166 false {Http-1.1:10.42.38.13:0:dc209_11:dc209elevenDomain:/0000D1ACF85F8E9148A7875B57F87F66344B} [pool:dc209_11@dc209elevenDomain] {0:""} 03.21 17:08:42 [pool:dc209_11@dc209elevenDomain:remove] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm {0:"Transfer failed and replica is FROM_POOL"} 03.21 17:08:43 [pool:dcdoor11_1@dcdoor11oneDomain:remove] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm {0:"cleaner@cleanerDomain [PoolRemoveFiles]"}
root@dcdoor11 dcache]# grep 0000D1ACF85F8E9148A7875B57F87F66344B dcdoor11oneDomain.log 21 Mar 2022 17:08:42 (dcdoor11_1) [door:WebDAV2-dcdoor02-2@webdav2-dcdoor02_httpsDomain:AAXawOazNxA RemoteTransferManager PoolAcceptFile 0000D1ACF85F8E9148A7875B57F87F66344B] Transfer failed: rejected GET: 500 Server Error
[root@dc209 dcache]# grep 0000D1ACF85F8E9148A7875B57F87F66344B * grep: check_pool_logs: Is a directory dc209elevenDomain.log:21 Mar 2022 17:08:43 (dc209_11) [door:WebDAV2-dcdoor02-2@webdav2-dcdoor02_httpsDomain:AAXawOazNxA dcdoor11_1 HttpDoorUrlInfo 0000D1ACF85F8E9148A7875B57F87F66344B] P2P for 0000D1ACF85F8E9148A7875B57F87F66344B failed: CacheException(rc=10001;msg=No such file or directory: 0000D1ACF85F8E9148A7875B57F87F66344B)
[root@dcdoor12 dcache]# zgrep data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data webdav2-dcdoor12_httpsDomain.* webdav2-dcdoor12_httpsDomain.access:level=INFO ts=2022-03-21T17:08:42.617-0400 event=org.dcache.webdav.request request.method=POST request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=200 socket.remote=[2620:0:210:1::1c]:54658 user-agent="fts_url_copy/3.11.1 gfal2/2.20.2 neon/0.0.29" user.dn="CN=1647725402,CN=1546877136,CN=1399668574,CN=Robot: ATLAS Data Management,CN=531497,CN=ddmadmin,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=6435:31152,69907 request.macaroon-request="{\"caveats\": [\"activity:LIST,DOWNLOAD\"], \"validity\": \"PT196M\"}" response.macaroon-id=61xQF3bx:yJhMgeIQ duration=0 webdav2-dcdoor12_httpsDomain.access:level=ERROR ts=2022-03-21T17:08:42.714-0400 event=org.dcache.webdav.request request.method=GET request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=500 socket.remote=192.12.15.232:38288 user-agent=dCache/7.2.7 user.mapped=6435:31152,69907[61xQF3bx:yJhMgeIQ] transaction=door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXawOa0RJg:1647896922709000 duration=11 webdav2-dcdoor12_httpsDomain.access:level=WARN ts=2022-03-21T17:08:55.129-0400 event=org.dcache.webdav.request request.method=COPY request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=202 socket.remote=[2620:0:210:1::1c]:54668 user-agent="fts_url_copy/3.11.1 gfal2/2.20.2 neon/0.0.29" user.dn="CN=1647725402,CN=1546877136,CN=1399668574,CN=Robot: ATLAS Data Management,CN=531497,CN=ddmadmin,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=6435:31152,69907[61xQF3bx:yJhMgeIQ] tpc.credential=none tpc.error="rejected PUT: 500 Server Error" tpc.require-checksum=false tpc.destination=https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data duration=12287