dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

No unpin done after release by bulk service #7516

Open ageorget opened 9 months ago

ageorget commented 9 months ago

Hi,

For this data challenge, LHCb is trying to use the Tape REST API to recall files from tape at IN2P3. I'm currently in a situation where all the LHCb staging pool are full with pinned files even after FTS release the file from the API.

On FTS, the release is called after the file is transfer to another endpoint :

INFO    Wed, 21 Feb 2024 15:48:38 +0100; Gfal2: Copy succeeded using mode 3rd pull
INFO    Wed, 21 Feb 2024 15:48:38 +0100; [1708526918783] BOTH http_plugin   TRANSFER:EXIT   https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030115/datamover08-m/01edf986-0b8a-4dc4-b14e-4c7d97262636.raw ([2001:660:5009:92:134:158:209:208]) => https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Disk/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030115/datamover08-m/01edf986-0b8a-4dc4-b14e-4c7d97262636.raw
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Gfal2: Event triggered: BOTH http_plugin TRANSFER:EXIT https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030115/datamover08-m/01edf986-0b8a-4dc4-b14e-4c7d97262636.raw ([2001:660:5009:92:134:158:209:208]) => https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Disk/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030115/datamover08-m/01edf986-0b8a-4dc4-b14e-4c7d97262636.raw
...

INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: > POST /api/v1/tape/release/47b1ad66-0ad0-4a06-a3be-bcd8d48722be HTTP/1.1
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: > Host: ccdcamcli06.in2p3.fr:3880
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: > Accept: */*
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: > ClientInfo: job-id=4b75feda-cff5-11ee-b86a-fa163e5a69c7;file-id=88832425;retry=0
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: > Content-Type: application/json
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: > User-Agent: libdavix/0.8.5 libcurl/7.69.0-DEV
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: > Content-Length: 156
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: 
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: < HTTP/1.1 200 OK
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: < Date: Wed, 21 Feb 2024 14:48:38 GMT
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: < Server: dCache/9.2.6
INFO    Wed, 21 Feb 2024 15:48:38 +0100; Davix: < Content-Length: 0

On the bulk service, the request is COMPLETE :

2024-02-20 14:46:55.656 | 2024-02-20 14:46:55.656 | 2024-02-21 15:43:51.009 | COMPLETED | /pnfs/in2p3.fr/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030115/datamover08-m/01edf986-0b8a-4dc4-b14e-4c7d97262636.raw

But if I check on the pool, the file is sticky and the pin set to expire in 2 days :

[ccdcamcli06] (pool-lhcb-read-li437a@ccdcacli437-pool-lhcb-read-li437a-Domain) ageorget > \sn pnfsidof /pnfs/in2p3.fr/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030115/datamover08-m/01edf986-0b8a-4dc4-b14e-4c7d97262636.raw
0000237F820924294893AD6B189AFB194EE6

[ccdcamcli06] (pool-lhcb-read-li437a@ccdcacli437-pool-lhcb-read-li437a-Domain) ageorget > rep ls 0000237F820924294893AD6B189AFB194EE6
0000237F820924294893AD6B189AFB194EE6 <C-------X--L(0)[0]> 5368709120 si={lhcb:lhcb}

[ccdcamcli06] (pool-lhcb-read-li437a@ccdcacli437-pool-lhcb-read-li437a-Domain) ageorget > rep sticky ls 0000237F820924294893AD6B189AFB194EE6
PinManager-0cb22b4c-db0b-44d3-9747-5ba9f8844ae0 : expires 2/23/24, 4:13 PM

No error in the PinManager logs nor bulk logs.

Any idea what could avoid the pin to be released?

Adrien

lemora commented 9 months ago

Hi Adrien.

Unpinning is done periodically (default every 60s) in chunks of a configurable size by PinManager. Maybe the unpin requests are still waiting in the PinManager database to be processed. What is the value of this parameter in your setup: pinmanager.max-unpins-per-run ? You could try increasing it to speed up the process.

Do you have many pins in state READY_TO_UNPIN OR FAILED_TO_UNPIN in the PinManager database? The latter will be retried after two hours (configurable). In the admin interface you can run \s PinManager count pins to show how many pins are in what state.

lemora commented 9 months ago

A note on the pool pins: When someone manually triggers unpinning and PinManager at some point starts the unpinning of this pin, it will send a message to the pool so it can update/remove its local pin. Until then, the pool's local copy will have the lifetime confgured witch which the pin was created (or later updated to have). The idea is that if an unpin request is not issued, the pool can take care of removing its pin copy after it expires without additional messages needing to be sent by PinManager.

ageorget commented 9 months ago

Hi Lea,

I have only few READY_TO_UNPIN pins in PinManager. We use the default value 200 to pinmanager.max-unpins-per-run and we never had any issue with this value. I can try to increase it to 1k.

I just noticed thousands of messages like this in bulk with info -l

We are waiting for the following messages
<-1887545599502104009:-8599270973741174048> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<1997532960025890261:-6466367291753012051> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<-417089792157071491:-6531966288995633783> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<197064469393197806:-8755869831036019408> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<-2978437343650035021:-8361377862430911445> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<6681677722378325908:-7828387334932881086> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<102253051408368870:-7375787922466375525> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<-1563051502383447299:-8034720557027671906> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<-3510179856851186643:-5476118133343840651> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage
<4864028699198506425:-9156454993026933112> r=9223372036854775 sec;msg=org.dcache.pinmanager.PinManagerPinMessage

The bulk service can contact the PinManager as I see many pinned messages (but not unpinned)

Feb 21 17:29:39 ccdcamcli08 dcache@PinManagerDomain[22679]: 21 Feb 2024 17:29:39 (PinManager) [bulk PinManagerPin] Pinned 000013B831E4D1A7411A8181387D1D269691 on pool-lhcb-read-li437a (947746655)
Feb 21 17:29:41 ccdcamcli08 dcache@PinManagerDomain[22679]: 21 Feb 2024 17:29:41 (PinManager) [bulk PinManagerPin] Pinned 0000415056EAB4CA44FFAF4970649DEB73DD on pool-lhcb-read-li391a (947644838)
ageorget commented 9 months ago

And in the pins table, I only have PINNED entries. Nothing in READY_TO_UNPIN OR FAILED_TO_UNPIN

947734135 | 00001D7EF69E93F34D869DFD311C5C128DAF | 6879481f-503b-43c3-8f2c-af239ad1325f | 2024-02-21 16:15:55.967+01 | 2024-02-23 16:59:09.435+01 | 3437 | 155 | PINNED | pool-lhcb-read-li466a | PinManager-0c101b2e-a216-4e6f-894a-1126874aef42
 947733378 | 0000805EB39A67CA4CAF9220C6994B2421D0 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.975+01 | 2024-02-23 17:03:54.672+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-8d6bb268-cc00-4186-90e9-827253f9f282
 947733376 | 00004EFBA99B4326456DA35FCA0A525AA81C | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.973+01 | 2024-02-23 17:02:59.794+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-d70c1e0d-22a4-44a0-acd4-3471d0af5adf
 947733371 | 000038299EC942B34A90B6283505659A4736 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.971+01 | 2024-02-23 16:54:19.868+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-cf9d672a-b7ce-4277-90d8-fc7067a2829a
 947733354 | 000092F4D17B86964D3AAABB052C940BA03C | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.958+01 | 2024-02-23 16:48:37.165+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-eab3a1c4-d8f4-4ab6-bdf4-8638b3953cf9
 947733336 | 00006D29674723864F8894136E30D9047B38 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.939+01 | 2024-02-23 16:35:22.599+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-e0bd63d1-683e-41ad-9f76-40cd44e8e367
 947733329 | 000064AFD3541D1144CEBCA617752C89728A | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.932+01 | 2024-02-23 16:32:42.411+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-6b3543b9-5ccc-41c4-bd60-d3f48bdbc4f0
 947733314 | 0000E383A22C10A24864843D065CBC7329BC | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.921+01 | 2024-02-23 16:59:05.936+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-0b5aac7e-b238-4323-8acc-31ebe20f442e
 947733293 | 0000506F5ED69787444A9F5F8D9ECBF38202 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.901+01 | 2024-02-23 16:26:37.517+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-c806f5e7-5c64-4168-94f3-0632e33500fd
 947733292 | 0000DDA012C5D6B3468DB1057461E27DF320 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.899+01 | 2024-02-23 16:35:21.028+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-4433c315-0d2b-434e-a6d0-f853ac369b42
 947733287 | 0000C9983E50D94C4005BCF259BA78059972 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.895+01 | 2024-02-23 16:34:35.079+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-fd9676f0-5888-4d8c-9a3d-543107bed2af
 947733276 | 000031F4B05104874E3382306FB71ADB5520 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.883+01 | 2024-02-23 16:21:35.95+01  | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-7bce3b93-4156-40ef-ad30-a85a99a4eb71
 947733272 | 00009850101AA9C94324A187CC0F1BE6ECD5 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.878+01 | 2024-02-23 15:47:17.285+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-b1bb3513-95d4-4b05-826f-4fe0aa4f9dee
 947733261 | 000092143AAD4C1B457588D3874501E08BCA | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.874+01 | 2024-02-23 16:44:14.121+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-b8dac9fb-4829-46e4-a3f0-4c53489920b5
 947733254 | 0000D510FCEBF03647D495FBFE9B871904E7 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.871+01 | 2024-02-23 16:45:08.633+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-07c61adf-f940-41d5-83c7-9d17cd74a5d0
 947733255 | 0000C08403DDB52E40D9826A48BB3E1397DD | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.871+01 | 2024-02-23 15:52:19.504+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-5e9d8c5b-024e-425e-b399-441416005cc6
 947733246 | 00005CEFDE03F48F434AA1EB7AB71BBBACF0 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.867+01 | 2024-02-23 16:19:16.458+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-0f1ec569-fcd1-4d4b-8449-21a68eed204d
 947733244 | 00005C39F6CA36654F72B7091F8E38031BB9 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.866+01 | 2024-02-23 16:35:34.626+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-0b29cbe2-df89-4ad0-9ccd-e969bca065e9
 947733243 | 0000EC1132B0F23B4CFEA2DF66E963A04F75 | 190a3c4e-1a42-4313-802e-a76e690070a8 | 2024-02-21 15:24:55.865+01 | 2024-02-23 16:29:18.833+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-bb0fccc4-596f-40bb-b4e7-dc00d8a0ee1c
 947731443 | 000050D0F3CD81CA45EE97CA6CECE60B19C1 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.697+01 | 2024-02-23 15:09:48.476+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-3226990c-2840-407d-bbc8-1d3e600f7c62
 947731436 | 00003BA3E952D552415181D8E3D841832E65 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.696+01 | 2024-02-23 15:09:06.212+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-4d6d8128-5d94-45f2-82fb-9e4b138f29af
 947731434 | 00003D89FF601858410BAC5D6C884F3C5163 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.693+01 | 2024-02-23 15:52:21.757+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-ffbff9f5-33a5-4232-a290-a3563ea27e07
 947731429 | 00004015E8C7B67E484EACFE7D7CA4A5A322 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.692+01 | 2024-02-23 15:03:21.76+01  | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-76c4bb19-31b2-473c-9fe9-d899208b46b1
 947731430 | 0000E2CAC9DC13E5446B969AD88DDCD7E000 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.692+01 | 2024-02-23 15:32:37.596+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-97c76748-e8bf-46b1-9ffb-b5af2e49324d
 947731427 | 000008F0B6A88A83419EA9BB3B3659C0D143 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.691+01 | 2024-02-23 15:51:03.805+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-7790cb5b-ab65-4e76-ab17-cd6eb66d946e
 947731424 | 000029BBADE71D104D00A7ED59B816E5DCF2 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.69+01  | 2024-02-23 15:08:44.864+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-0cb6e521-cfe1-4fb1-a183-92ad3d2a58d4
 947731416 | 0000DFDDB1EA1210467B85278CEA0486CB1D | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.686+01 | 2024-02-23 15:37:12.545+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-a6259a41-5d2f-4c23-ab8c-339c47efaec1
 947731417 | 0000406B115A0869419580E04569A87EF978 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.686+01 | 2024-02-23 15:11:04.494+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-cba8e49a-cd76-4ab0-93dc-c294ddb16ac6
 947731408 | 000043FF653C9F884F40BE41EF86123DC2D0 | be8de29f-c618-4255-b219-f9ff44e80214 | 2024-02-21 13:49:49.683+01 | 2024-02-23 15:06:34.923+01 | 3437 | 155 | PINNED | pool-lhcb-read-li391a | PinManager-aec581a6-3e58-40fe-8e57-da2d3c3dbc97
ageorget commented 9 months ago

In fact, all RELEASE request are in QUEUED status in the bulk database and never decreasing :

 select status, count(1) from bulk_request where activity like 'RELEASE' group by 1 ;
 status | count 
--------+-------
 QUEUED |  7396
 7729 | 3a32111b-a364-448b-8d80-7a65629458b0 | 3437:155 | 2024-02-21 02:09:45.4   |            | 2024-02-21 02:09:45.419 | QUEUED | RELEASE  | NONE               | https://ccdcamcli06.in2p3.fr:3880/api/v1/tape/release/091cb274-c523-4258-97e3-3a91f2cf9177 | /             | t    
            | t                | t                 | 
  8676 | a646e377-071a-4ea2-92e1-e9a3eb994d0c | 3437:155 | 2024-02-21 15:02:15.561 |            | 2024-02-21 15:02:15.567 | QUEUED | RELEASE  | NONE               | https://ccdcamcli06.in2p3.fr:3880/api/v1/tape/release/c123ff88-8096-4210-be90-c60ed378a781 | /             | t    
            | t                | t                 | 
  9982 | 73e5c26e-63d1-456a-8d41-a16511fb90e5 | 3437:155 | 2024-02-21 17:05:29.218 |            | 2024-02-21 17:05:29.26  | QUEUED | RELEASE  | NONE               | https://ccdcamcli06.in2p3.fr:3880/api/v1/tape/release/ae10d22b-0506-4ffa-8541-a4cc17dd2aaf | /             | t    
            | t                | t                 | 
  7768 | 8d21e1af-3a92-46d9-9e61-8b24efbc375b | 3437:155 | 2024-02-21 13:38:20.697 |            | 2024-02-21 13:38:20.709 | QUEUED | RELEASE  | NONE               | https://ccdcamcli06.in2p3.fr:3880/api/v1/tape/release/01d66ec7-ff17-4af8-8e53-e82d2cb4cd05 | /             | t    
            | t                | t                 | 
  7776 | 08db656d-cb03-4b12-bed1-c4aa8dba3f9f | 3437:155 | 2024-02-21 13:39:03.052 |            | 2024-02-21 13:39:03.059 | QUEUED | RELEASE  | NONE               | https://ccdcamcli06.in2p3.fr:3880/api/v1/tape/release/c90ff633-478f-48c3-b06d-ad42ccc3bcb1 | /             | t    
            | t                | t                 | 
lemora commented 9 months ago

It does not look like the problem has to do with PinManager, you should probably leave its config as-is.

How many requests in bulk are in status STARTED? Do they decrease? Could you please enable trace logging in bulk, at least for a while?

ageorget commented 9 months ago

The staging works, the problem started last night when the default value max-requests-per-user was reached and new requests failed (even release requests). So I increased it from 5k (definitely too low for WLCG) to 20k and unpined all sticky files already copied from lhcb buffer. I also increased maxConcurrentRequestsfrom 100 to 1000.

[ccdcamcli06] (bulk@bulkDomain) ageorget > status counts
---------- REQUESTS ----------
     STATUS |           COUNT
  COMPLETED |             170
    STARTED |             215

---------- TARGETS -----------
      STATE |           COUNT
  COMPLETED |           18364
     FAILED |           24611
    RUNNING |           21178
[ccdcamcli06] (bulk@bulkDomain) ageorget > status counts
---------- REQUESTS ----------
     STATUS |           COUNT
  COMPLETED |             170
    STARTED |             234

---------- TARGETS -----------
      STATE |           COUNT
  COMPLETED |           18664
     FAILED |           24611
    RUNNING |           21533

This is the logs for the info -l and info -a command on the bulk service with all these strange waiting messages : bulkinfoa.admin.gz bulkinfol.admin.gz

The bulk debug log : bulk.log.gz

lemora commented 9 months ago

Thank you. This is probably unrelated, but does the request archiver component of bulk ever run for you?

Could you please send your bulk configuration; I am interested in the timeouts, especially the one for messages to PinManager. If you want we can move this conversation into an RT ticket.

ageorget commented 9 months ago

Hi Lea,

As this was our first " high load production" test with the bulk API, we started with default values for bulk. Then during the DC, we increased these 2 values because lot of release requests began to fail with HTTP/1.1 429 Too Many Requests

bulk.limits.container-processing-threads=1000
bulk.limits.max-requests-per-user=50000

LHCb finished their staging campaign so maybe this unpin issue was the consequence of limits too low and overloaded the queue.

How can I check that the archiver is working fine? I still have many completed requests with request ls :

request ls
ID           | ARRIVED             |            MODIFIED |        OWNER |     STATUS | UID
49           | 2024/02/19-16:31:10 | 2024/02/19-16:38:52 |     3437:155 |  COMPLETED | 2a927b2a-0705-482a-9eb3-5d8415bf2c41
51           | 2024/02/20-09:36:08 | 2024/02/20-12:16:15 |     3437:155 |  COMPLETED | b036d10b-3f98-4a6a-8fd1-486ed7e4b19b
52           | 2024/02/20-09:37:09 | 2024/02/20-12:23:18 |     3437:155 |  COMPLETED | bd9b28fc-98f6-4865-b75e-7992cf5c5288
53           | 2024/02/20-09:37:11 | 2024/02/20-12:09:18 |     3437:155 |  COMPLETED | 410a599d-a541-4d71-a964-bc2e9665e265
...
35072        | 2024/02/22-23:48:14 | 2024/02/22-23:48:14 |     3437:155 |  COMPLETED | 516ccc6b-a326-43c9-a8be-eafb353ea39a
35396        | 2024/02/23-00:31:10 | 2024/02/23-00:31:19 |     3437:155 |  COMPLETED | 752dde8f-5ace-46c6-bec6-79fdfd94bc89
35397        | 2024/02/23-00:31:11 | 2024/02/23-00:31:11 |     3437:155 |  COMPLETED | 93e29f62-5d50-432f-bbdb-5ae4aba81fa5
35413        | 2024/02/23-00:33:07 | 2024/02/23-00:33:08 |     3437:155 |  COMPLETED | 59c9c1ad-7491-401b-b5de-3f1e3f368924

and with requet archived ls

request archived ls
2ead95d9-b5a4-46b8-b2ae-d31c63b5ace7     |    3437:155 | Tue Feb 20 09:38:15 CET 2024 |      RELEASE | COMPLETED 
b87ce4f7-a9d3-4aeb-b1e6-9b01ac649924     |    3437:155 | Tue Feb 20 09:45:55 CET 2024 |      RELEASE | COMPLETED 
5f3ce15f-ab37-47a9-ba42-c5c00d114312     |    3437:155 | Tue Feb 20 09:47:13 CET 2024 |      RELEASE | COMPLETED 
d0b8807d-5e92-4baf-b948-6a5773d8da5a     |    3437:155 | Tue Feb 20 09:47:29 CET 2024 |      RELEASE | COMPLETED 
14e18bf7-89ff-4c80-8b72-c8ac72e88fb8     |    3437:155 | Tue Feb 20 09:47:30 CET 2024 |      RELEASE | COMPLETED 
ffa2145e-ca1e-4dcc-a983-352dc33330f7     |    3437:155 | Tue Feb 20 09:47:30 CET 2024 |      RELEASE | COMPLETED 
ec80d54e-3dab-4806-96aa-8def2fa878f7     |    3437:155 | Tue Feb 20 09:47:42 CET 2024 |      RELEASE | COMPLETED 
...
b8c904cb-6717-4119-bee9-5ba2522101e2     |    3437:155 | Wed Feb 21 17:47:16 CET 2024 |      RELEASE | COMPLETED 
e6ec4e92-12e2-4204-b78c-f7f94a6f39b4     |    3437:155 | Wed Feb 21 17:47:16 CET 2024 |      RELEASE | COMPLETED 
9b25efa1-a59e-49fb-ac46-869fd25cd5bf     |    3437:155 | Wed Feb 21 17:47:16 CET 2024 |      RELEASE | COMPLETED 
c5504714-b680-42bf-8036-2e48e8fac24d     |    3437:155 | Wed Feb 21 17:47:16 CET 2024 |      RELEASE | COMPLETED 
f8ab20ff-d3cb-4236-bd22-a6741411fd51     |    3437:155 | Wed Feb 21 17:47:16 CET 2024 |      RELEASE | COMPLETED 
lemora commented 8 months ago

Hi Adrien.

The info -a and info -l commands show that the archiver did not yet run since bulk was last started, but maybe it was not up for long at that point. Is it still at January 1, 1970? But since you have archived requests, this probably is fine.

The situation is tricky, we don't have much to go on right now -- the logs don't report anything out of order as far as I can see.

When you set the new parameters in bulk, you restarted the service, I assume? And then, if I understand you correctly, you unpinned all sticky files that had already been copied away from your system. A new bulk release request that arrived later on was then reported as successful, even though the files were not unpinned.

Did you check if all targeted files were not unpinned or if some of them were? Did you see Access denied error messages in the PinManager logs, or an Unpinned <pnfsid> <pin id> message corresponding to any of these targeted pins? Are some unpin requests in bulk in a failed state that you can retry?

The bulk cell is waiting for answers to messages sent to PinManager, but these might be different ones than those that were not unpinned, but rather regular requests in progress. Could you check if the pnfsids correspond to the FTS request?

I am still looking through the code, but it seems like either messages get lost, or unpin failed/timeout replies from PinManager are not addressed properly by bulk (setting the request to COMPLETE instead of FAILED). We would probably see those in the log, though...

ageorget commented 8 months ago

Hi Lea,

I agree this is tricky, maybe there is no problem with the bulk service and it just needed some parameters tuning to match the DataChallenge load.

The info -a and info -l commands show that the archiver did not yet run since bulk was last started, but maybe it was not up for long at that point. Is it still at January 1, 1970? But since you have archived requests, this probably is fine.

The archiver is running fine I think, and yes the last ran has changed

--- request-archiver ---
Archiver Period:          :    6 HOURS
Archiver Window:          :    14 DAYS
Last ran at:              :    Fri Mar 01 10:43:02 CET 2024

When you set the new parameters in bulk, you restarted the service, I assume? And then, if I understand you correctly, you unpinned all sticky files that had already been copied away from your system.

Yes the bulk service was restarted to apply the changes. To unstuck the situation (full tape buffer pools) I get the list of sticky files on buffer pools with rep ls -l=s and then unpin them from the PinManager with unpin * <pnfsid>. The files were pinned for more than 12 hours so I assumed FTS had enough time to copy them from the buffer...

Did you check if all targeted files were not unpinned or if some of them were? Did you see Access denied error messages in the PinManager logs, or an Unpinned <pnfsid> <pin id> message corresponding to any of these targeted pins? Are some unpin requests in bulk in a failed state that you can retry?

There was no error in the PinManager logs, some files were unpinned at first but then it was like if the communication between the bulk service and the PinManager stopped.

If I try to manually stage and release a file from the API, everything looks fine now...