dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

Bulk release requests are not working if using relative path #7635

Open ageorget opened 3 months ago

ageorget commented 3 months ago

Hi,

I found that release process is not working when the release is using relative path (without prefix) and this could explain why our Atlas staging buffer is full most of the time.

To reproduce it, I send a staging request of this file /atlasmctape/mc16_13TeV/HITS/e8351_s3126/mc16_13TeV.700337.Sh_2211_Znunu_pTV2_CVetoBVeto.simul.HITS.e8351_s3126_tid30364865_00/HITS.30364865._017868.pool.root.1

cat stageAtlas.json
{
"files": [
{"path": "/atlasmctape/mc16_13TeV/HITS/e8351_s3126/mc16_13TeV.700337.Sh_2211_Znunu_pTV2_CVetoBVeto.simul.HITS.e8351_s3126_tid30364865_00/HITS.30364865._017868.pool.root.1","diskLifetime":"PT1H"}
]
}

curl --capath /etc/grid-security/certificates --cacert $X509_USER_PROXY --cert $X509_USER_PROXY -X POST "https://ccdcamcli08.in2p3.fr:3880/api/v1/tape/stage" -H  "accept: application/json" -H  "content-type: application/json" -d @stageAtlas.json
{
  "requestId" : "1b72f21e-d66a-4af7-a784-6178a3c3a35c"
}%        

level=INFO ts=2024-08-12T16:07:39.771+0200 event=org.dcache.frontend.request request.method=POST request.url=https://ccdcamcli08.in2p3.fr:3880/api/v1/tape/stage response.code=201 response.reason=Created location=https://ccdcamcli08.in2p3.fr:3880/api/v1/tape/stage/1b72f21e-d66a-4af7-a784-6178a3c3a35c socket.remote=[2001:660:5009:84:134:158:239:7]:35504 user-agent=curl/7.29.0 user.dn="CN=1855496286,CN=GEORGET Adrien adrien.georget.2@cnrs.fr,O=Centre national de la recherche scientifique,C=FR,DC=tcs,DC=terena,DC=org" user.mapped=3327:124 request.entity="{\"files\":[{\"path\"[...]fetime\":\"PT1H\"}]}" response.entity="{\n  \"requestId\" : \"1b72f21e-d66a-4a[...]" duration=15

Staging is OK and file is pinned on disk cache :

\s pool-atlas-read-li425a rep sticky ls 000098FBFE5589274CABB284DA5BBB379C4B
self : expires 8/12/24, 4:12 PM
PinManager-0649a68f-2bc8-48e6-8138-c40d0b4bf130 : expires 8/14/24, 4:37 PM

Then I release the file using his relative path :

archiveinfo.json 
{
"paths": ["/atlasmctape/mc16_13TeV/HITS/e8351_s3126/mc16_13TeV.700337.Sh_2211_Znunu_pTV2_CVetoBVeto.simul.HITS.e8351_s3126_tid30364865_00/HITS.30364865._017868.pool.root.1"]
}

curl --capath /etc/grid-security/certificates --cacert $X509_USER_PROXY --cert $X509_USER_PROXY -X POST "https://ccdcamcli08.in2p3.fr:3880/api/v1/tape/release/1b72f21e-d66a-4af7-a784-6178a3c3a35c" -H  "accept: application/json" -H  "content-type: application/json" -d @archiveinfo.json

level=INFO ts=2024-08-12T16:10:47.568+0200 event=org.dcache.frontend.request request.method=POST request.url=https://ccdcamcli08.in2p3.fr:3880/api/v1/tape/release/1b72f21e-d66a-4af7-a784-6178a3c3a35c response.code=200 response.reason=OK socket.remote=[2001:660:5009:84:134:158:239:7]:35512 user-agent=curl/7.29.0 user.dn="CN=1855496286,CN=GEORGET Adrien adrien.georget.2@cnrs.fr,O=Centre national de la recherche scientifique,C=FR,DC=tcs,DC=terena,DC=org" user.mapped=3327:124 request.entity="{\"paths\":[\"/atlas[...]68.pool.root.1\"]}" duration=11

After 30min, pin is always active :

\s pool-atlas-read-li425a rep sticky ls 000098FBFE5589274CABB284DA5BBB379C4B
PinManager-0649a68f-2bc8-48e6-8138-c40d0b4bf130 : expires 8/14/24, 4:37 PM

And if I try to release the file using his full path, the file is instantly unpin from the disk :

cat archiveinfo.json
{
"paths": ["/pnfs/in2p3.fr/data/atlas/atlasmctape/mc16_13TeV/HITS/e8351_s3126/mc16_13TeV.700337.Sh_2211_Znunu_pTV2_CVetoBVeto.simul.HITS.e8351_s3126_tid30364865_00/HITS.30364865._017868.pool.root.1"]
}

[16:17]:curl --capath /etc/grid-security/certificates --cacert $X509_USER_PROXY --cert $X509_USER_PROXY -X POST "https://ccdcamcli08.in2p3.fr:3880/api/v1/tape/release/1b72f21e-d66a-4af7-a784-6178a3c3a35c" -H  "accept: application/json" -H  "content-type: application/json" -d @archiveinfo.json

level=INFO ts=2024-08-12T16:17:19.146+0200 event=org.dcache.frontend.request request.method=POST request.url=https://ccdcamcli08.in2p3.fr:3880/api/v1/tape/release/1b72f21e-d66a-4af7-a784-6178a3c3a35c response.code=200 response.reason=OK socket.remote=[2001:660:5009:84:134:158:239:7]:35522 user-agent=curl/7.29.0 user.dn="CN=1855496286,CN=GEORGET Adrien adrien.georget.2@cnrs.fr,O=Centre national de la recherche scientifique,C=FR,DC=tcs,DC=terena,DC=org" user.mapped=3327:124 request.entity="{\"paths\":[\"/pnfs/[...]68.pool.root.1\"]}" duration=27

In PinManager : Aug 12 16:17:20 ccdcamcli08 dcache@PinManagerDomain[129736]: 12 Aug 2024 16:17:20 (PinManager) [BackgroundUnpinner-201460] Unpining [955776409] 000098FBFE5589274CABB284DA5BBB379C4B (1b72f21e-d66a-4af7-a784-6178a3c3a35c) by 3327:124 2024-08-12 16:07:39 to 2024-08-14 16:07:45 is READY_TO_UNPIN on pool-atlas-read-li425a:PinManager-0649a68f-2bc8-48e6-8138-c40d0b4bf130

[ccdcamcli06] (bulk@bulkDomain) ageorget > \s pool-atlas-read-li425a rep sticky ls 000098FBFE5589274CABB284DA5BBB379C4B
[ccdcamcli06] (bulk@bulkDomain) ageorget > 

Bulk service also doesn't report when a release request is not done. Can you check this please?

Adrien

DmitryLitvintsev commented 3 months ago

Likely the same patch I did for staging needs to be applied to release.

DmitryLitvintsev commented 3 months ago

OK. Like last time. Here I have built an RPM with a patch:

https://drive.google.com/file/d/1mgXibWbUUnqM0WsRclAKh-K8x3awIBkx/view?usp=sharing

Could you deploy it on you frontend door. Before doing so, make sure you tried it on our test system.

ageorget commented 3 months ago

Thank you Dmitry for your quick fix. I just copied the frontend jar from the RPM like last time and Unpinning seems to work now :

Aug 13 10:03:04 ccdcamcli08 dcache@PinManagerDomain[129736]: 13 Aug 2024 10:03:04 (PinManager) [bulk PinManagerUnpin] Unpinned 0000D63CB52D45B0404F9AF60A8F8F8DDDE9 (955788379)
Aug 13 10:03:06 ccdcamcli08 dcache@PinManagerDomain[129736]: 13 Aug 2024 10:03:06 (PinManager) [bulk PinManagerUnpin] Unpinned 0000235E8344B2FA40B1A56C2E5F002231C4 (955788682)
Aug 13 10:03:06 ccdcamcli08 dcache@PinManagerDomain[129736]: 13 Aug 2024 10:03:06 (PinManager) [bulk PinManagerUnpin] Unpinned 0000A9031930F46E4E8D8540F22030C07F62 (955789107)
Aug 13 10:03:06 ccdcamcli08 dcache@PinManagerDomain[129736]: 13 Aug 2024 10:03:06 (PinManager) [bulk PinManagerUnpin] Unpinned 000040FBBED815A44A759CE710C56EF80CA3 (955789383)
Aug 13 10:03:07 ccdcamcli08 dcache@PinManagerDomain[129736]: 13 Aug 2024 10:03:07 (PinManager) [bulk PinManagerUnpin] Unpinned 0000536A5798A7474E109E7D02D2DD9D8683 (955788506)
DmitryLitvintsev commented 3 months ago

yes. Sorry for all this. This should have been fixed in one go.