dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

transfers failing with rejected PUT: 500 Server Error #6557

Closed cfgamboa closed 6 months ago

cfgamboa commented 2 years ago

Hello all,

we have seen many transfers failing with rejected PUT: 500 Server Error https://fts304.usatlas.bnl.gov:8449/var/log/fts3/transfers/2022-03-21/dcsrm.usatlas.bnl.gov__dcgftp.usatlas.bnl.gov/2022-03-21-1708__dcsrm.usatlas.bnl.gov__dcgftp.usatlas.bnl.gov__1282623060__d7b80cb2-a905-11ec-8f8e-b49691292ed8

At the billing log the file appears to be available for copy, here the source is PNSFID 0000EEC369BA76464902A124C6842E7DE7CA, file name:

/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data

Looking at the Billinglogs for that transfer

  1. Not working transfer
    
    grep data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data billing-2022.03.21

03.21 17:08:42 [door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:request] ["usatlas1":6435:31152:192.12.15.232] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 5 0 {10011:"Task was cancelled."}

03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 0 0 true {RemoteHttpsDataTransfer-1.1:https://dcdoor12.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922644-162976] {10027:"rejected GET: 500 Server Error"}

03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 82 40 {10027:"rejected GET: 500 Server Error"}

03.21 17:08:55 [door:WebDAV2-dcdoor08-2@webdav2-dcdoor08_httpsDomain:request] ["usatlas1":6435:31152:10.42.38.52] [0000F1646266F48845FE8BBB730D97F75E69,2621829888] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 15 0 {10011:"Task was cancelled."}

03.21 17:08:55 [pool:dcdoor08_1:transfer] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 0 0 false {RemoteHttpsDataTransfer-1.1:https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922851-162979] {10027:"rejected PUT: 500 Server Error"}

03.21 17:08:55 [pool:dcdoor08_1:transfer] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 0 0 false {RemoteHttpsDataTransfer-1.1:https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922851-162979] {10027:"rejected PUT: 500 Server Error"}

03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000EEC369BA76464902A124C6842E7DE7CA,0] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 12275 12216 {10027:"rejected PUT: 500 Server Error"}


2. working transfer
Similar request appears to succeed with

03.21 18:00:34 [door:WebDAV-dcdoor18-2@webdav-dcdoor18_httpsDomain:request] ["usatlas1":6435:31152:192.12.15.232] [0000EEC369BA76464902A124C6842E7DE7CA,2621829888] [/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt1d0:BNLT1D0@osm 17846 0 {0:""} 03.21 18:00:36 [pool:dcdoor11_1:transfer] [0000B8E98835C401469584C9A75D2D38F865,2621829888] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 2621829888 17902 true {RemoteHttpsDataTransfer-1.1:https://dcdoor18.usatlas.bnl.gov:2881/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647900017018-168986] {0:""} 03.21 18:00:17 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000B8E98835C401469584C9A75D2D38F865,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 20522 38 {0:""}


If we take a look at part 1. There is an error logged as {10011:"Task was cancelled."} 
 Looking at the time when the file is made available in. the pool

[root@dc209 dcache]# ls -l /data3/dcache_pool_38/pool/data/0000EEC369BA76464902A124C6842E7DE7CA -rw-r--r-- 1 root root 2621829888 Mar 21 17:05 /data3/dcache_pool_38/pool/data/0000EEC369BA76464902A124C6842E7DE7CA


The PNFSID of the file tha failed to be copy in 1. is `0000D1ACF85F8E9148A7875B57F87F66344B`

Looking at the billing logs for this pnfsid I see this:

[root@dcmaint02 03]# grep 0000D1ACF85F8E9148A7875B57F87F66344B billing-2022.03.21 03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 0 0 true {RemoteHttpsDataTransfer-1.1:https://dcdoor12.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data} [door:RemoteTransferManager@srm-dcsrm03Domain:1647896922644-162976] {10027:"rejected GET: 500 Server Error"} 03.21 17:08:42 [door:RemoteTransferManager@srm-dcsrm03Domain:request] ["usatlas1":6435:31152:unknown] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data] bnlt0d1:BNLT0D1@osm 82 40 {10027:"rejected GET: 500 Server Error"} 03.21 17:08:42 [pool:dcdoor11_1:transfer] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm 0 166 false {Http-1.1:10.42.38.13:0:dc209_11:dc209elevenDomain:/0000D1ACF85F8E9148A7875B57F87F66344B} [pool:dc209_11@dc209elevenDomain] {0:""} 03.21 17:08:42 [pool:dc209_11@dc209elevenDomain:remove] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm {0:"Transfer failed and replica is FROM_POOL"} 03.21 17:08:43 [pool:dcdoor11_1@dcdoor11oneDomain:remove] [0000D1ACF85F8E9148A7875B57F87F66344B,0] [Unknown] bnlt0d1:BNLT0D1@osm {0:"cleaner@cleanerDomain [PoolRemoveFiles]"}

I looked at the logs in the, pool: dcdoor11_1

root@dcdoor11 dcache]# grep 0000D1ACF85F8E9148A7875B57F87F66344B dcdoor11oneDomain.log 21 Mar 2022 17:08:42 (dcdoor11_1) [door:WebDAV2-dcdoor02-2@webdav2-dcdoor02_httpsDomain:AAXawOazNxA RemoteTransferManager PoolAcceptFile 0000D1ACF85F8E9148A7875B57F87F66344B] Transfer failed: rejected GET: 500 Server Error


And the ones in the pool dc209_11

[root@dc209 dcache]# grep 0000D1ACF85F8E9148A7875B57F87F66344B * grep: check_pool_logs: Is a directory dc209elevenDomain.log:21 Mar 2022 17:08:43 (dc209_11) [door:WebDAV2-dcdoor02-2@webdav2-dcdoor02_httpsDomain:AAXawOazNxA dcdoor11_1 HttpDoorUrlInfo 0000D1ACF85F8E9148A7875B57F87F66344B] P2P for 0000D1ACF85F8E9148A7875B57F87F66344B failed: CacheException(rc=10001;msg=No such file or directory: 0000D1ACF85F8E9148A7875B57F87F66344B)


It seems that since the file failed to be copied to the pool dcdoor11_1 the P2P copy failed.

We are rather puzzled about therefore we would like to know if you could please take a look at this ticket and provide advice as possible. Is there any other information from components that could be use I we are missing?

Below please also find the log for the door involve in the first failure attempt.

All the best,
Carlos

The log for the door involved in the failing transfer is

[root@dcdoor12 dcache]# zgrep data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data webdav2-dcdoor12_httpsDomain.* webdav2-dcdoor12_httpsDomain.access:level=INFO ts=2022-03-21T17:08:42.617-0400 event=org.dcache.webdav.request request.method=POST request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=200 socket.remote=[2620:0:210:1::1c]:54658 user-agent="fts_url_copy/3.11.1 gfal2/2.20.2 neon/0.0.29" user.dn="CN=1647725402,CN=1546877136,CN=1399668574,CN=Robot: ATLAS Data Management,CN=531497,CN=ddmadmin,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=6435:31152,69907 request.macaroon-request="{\"caveats\": [\"activity:LIST,DOWNLOAD\"], \"validity\": \"PT196M\"}" response.macaroon-id=61xQF3bx:yJhMgeIQ duration=0 webdav2-dcdoor12_httpsDomain.access:level=ERROR ts=2022-03-21T17:08:42.714-0400 event=org.dcache.webdav.request request.method=GET request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=500 socket.remote=192.12.15.232:38288 user-agent=dCache/7.2.7 user.mapped=6435:31152,69907[61xQF3bx:yJhMgeIQ] transaction=door:WebDAV2-dcdoor12-2@webdav2-dcdoor12_httpsDomain:AAXawOa0RJg:1647896922709000 duration=11 webdav2-dcdoor12_httpsDomain.access:level=WARN ts=2022-03-21T17:08:55.129-0400 event=org.dcache.webdav.request request.method=COPY request.url=https://dcdoor12.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT1D0/data16_13TeV/RAW/other/data16_13TeV.00300800.physics_Main.daq.RAW/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data response.code=202 socket.remote=[2620:0:210:1::1c]:54668 user-agent="fts_url_copy/3.11.1 gfal2/2.20.2 neon/0.0.29" user.dn="CN=1647725402,CN=1546877136,CN=1399668574,CN=Robot: ATLAS Data Management,CN=531497,CN=ddmadmin,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=6435:31152,69907[61xQF3bx:yJhMgeIQ] tpc.credential=none tpc.error="rejected PUT: 500 Server Error" tpc.require-checksum=false tpc.destination=https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data16_13TeV/4e/33/data16_13TeV.00300800.physics_Main.daq.RAW._lb0419._SFO-2._0002.data duration=12287

paulmillar commented 2 years ago

Hi @cfgamboa ,

So, at a high level, what's happening here is ATLAS wants to copy a file from T0D1 (i.e., disk) storage to T1D0 (i.e., tape) storage. To do this, they are trying to use HTTP-TPC, with both endpoints being at BNL.

There are two attempts.

The first attempt is a pull-request, so the COPY request is to the T1D0 url, with a Source header that points to the T0D1 url. This HTTP-TPC request fails with the message rejected GET: 500 Server Error.

The second attempt is a push request, so the COPY request targets the T0D1 url, with a Destination header that points to the T1D0 url. This HTTP-TPC request fails with the message rejected PUT: 500 Server Error.

In both cases, we see a Task was cancelled message, but from different pools. This message comes from the migration module on a pool, but unfortunately we don't log why this migration is being cancelled.

I would imagine the next step would be to understand why the migration module task is being cancelled.

cfgamboa commented 2 years ago

Hello @paulmillar

Thank you for your feedback. I have updated the FTS transfer URL with the one related to this transfer.

More contextual information is that the day before we had a hardware failure affecting the SRM component. I understand that the DAV doors were restarted as well later on. However, there were many transfers failing like this one afterwards. I had to restart the Dav doors and apparently this helped on alleviating this type of problem.

What is the expected behavior of the RemoteTransferManager per failure occurrence in SRM?

paulmillar commented 2 years ago

Hi @cfgamboa ,

A couple of quick questions:

Which version of dCache is this?

Is the problem ongoing, or are transfers succeeding again?

SRM and HTTP-TPC are unrelated; however, both the SRM and WebDAV doors use the transfermanagers service to orchestrate third-party transfers (SRM third-party transfers are no longer used by anyone). I believe sites often run the transfermanagers service close to the srmmanager service (e.g., in the same domain), so a hardware failure may have affected both.

The WebDAV door should recover from transfermanagers service being restarted, but this has been problematic in the past.

I've created a patch that should update the message Task was cancelled to include the reason why the task was cancelled. Once this patch passes code review and is part of a bug-fix release, you would need to update the pools to benefit from the change.

That said, I think the migration task being cancelled is likely a result of some underlying problem, rather than the cause of the transfers failing.

HTH, Paul.

cfgamboa commented 2 years ago

Hello @paulmillar

Which version of dCache is this? dcache-7.2.7 for doors and SRM, other components are 7.2.3

Is the problem ongoing, or are transfers succeeding again? Yes after restarting the doors the transfers succeeding again.

Indeed, the transfermanagers is part cell resource in the SRM domain.

Thank you for your feedback and work for in the patch.

All the best, Carlos