dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

Pools stop flushing to tape with "Negative number of active requests" #7511

Open onnozweers opened 9 months ago

onnozweers commented 9 months ago

Dear dCache devs,

We sometimes have tape write pools that stop flushing to tape after heavy activity, usually after they have been completely filled up with precious files. After a restart of the pool, they continue writing to tape again. Using 9.2.10 but I've seen this since a few years. Now that the WLCG DC24 is ongoing, we see this problem popping up repeatedly.

When it happens, the last line in the pool's log is usually something like this:

14 Feb 2024 22:21:42 (badger2_lhcbtape) [] Flush of 0000F727C00F2D234DF0B5B70069DEE9F0F2 failed with: java.lang.IllegalArgumentException: Negative number of active requests.

Any idea what the problem could be?

I'm afraid I don't have a lot more information than this; not sure what to look for, and probably hard to reproduce because load related. The pool's info output does not show anything that helps me to understand what's wrong.

Cheers, Onno

ageorget commented 9 months ago

Hi Onno,

At IN2P3, we are also affected by the same problem with LHCb since the beginning of the Data Challenge. Some migration pools stop flushing files to tape and stay stuck until a pool restart. But I don't think the flush became stuck because the pool is full of precious files. The pool became full of precious file because the flush stays stuck and files continue to be transferred on the pool. This is still related with the store counter not being always updated when a transfer is cancelled.

I opened an issue 2 years ago about this and @kofemann fixed that but I'm afraid the problem is back https://github.com/dCache/dcache/issues/6426

Adrien

onnozweers commented 9 months ago

Thanks Adrien! I had seen your issue, but saw that it was closed as "fixed", so I assumed this would be another issue.

kofemann commented 9 months ago

Déjà vu...

For now, we have no idea if it is the same issue as #6426 or a different one. Do you have any additional information, like file deletion in parallel, cancelled flushes or so?

onnozweers commented 9 months ago

OK, I'll collect some more information.

Here is a full log of one such pool, from restart until the next time flushing got stuck.

14 Feb 2024 17:04:31 (badger2_lhcbtape) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
14 Feb 2024 17:04:31 (badger2_lhcbtape) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
14 Feb 2024 17:04:31 (badger2_lhcbtape) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/space/lhcb/tape/pool/data;meta=/space/lhcb/tape/pool/meta]]].
14 Feb 2024 17:04:32 (badger2_lhcbtape) [] Pool mode changed to enabled
14 Feb 2024 17:25:12 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWfFRiPA RemoteTransferManager PoolAcceptFile 00004AA3EF42A13648F3AF573BC5EA75986E] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:25:12 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWfFRiPA RemoteTransferManager PoolAcceptFile 00004AA3EF42A13648F3AF573BC5EA75986E] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:25:12 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWfFRiPA RemoteTransferManager PoolAcceptFile 00004AA3EF42A13648F3AF573BC5EA75986E] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00004AA3EF42A13648F3AF573BC5EA75986E
14 Feb 2024 17:28:13 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRWfxDrJg RemoteTransferManager PoolAcceptFile 0000E2D275B036D2490CB21678E0A3EA2B06] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:28:13 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRWfxDrJg RemoteTransferManager PoolAcceptFile 0000E2D275B036D2490CB21678E0A3EA2B06] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:28:13 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRWfxDrJg RemoteTransferManager PoolAcceptFile 0000E2D275B036D2490CB21678E0A3EA2B06] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000E2D275B036D2490CB21678E0A3EA2B06
14 Feb 2024 17:28:26 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWf2NKEA RemoteTransferManager PoolAcceptFile 00001F93157F668B4E32A04B4FBC7F99F8B7] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:28:26 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWf2NKEA RemoteTransferManager PoolAcceptFile 00001F93157F668B4E32A04B4FBC7F99F8B7] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:28:26 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWf2NKEA RemoteTransferManager PoolAcceptFile 00001F93157F668B4E32A04B4FBC7F99F8B7] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00001F93157F668B4E32A04B4FBC7F99F8B7
14 Feb 2024 17:29:52 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRWgIFYiA RemoteTransferManager PoolAcceptFile 00007BF763650E9E4C8590B5A92C4B56076B] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:29:52 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWgICYJg RemoteTransferManager PoolAcceptFile 00001E25A89A55BF477BB2AC1ABD2C1D5AF5] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:29:52 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWgICYJg RemoteTransferManager PoolAcceptFile 00001E25A89A55BF477BB2AC1ABD2C1D5AF5] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:29:52 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRWgIFYiA RemoteTransferManager PoolAcceptFile 00007BF763650E9E4C8590B5A92C4B56076B] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:29:52 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWgICYJg RemoteTransferManager PoolAcceptFile 00001E25A89A55BF477BB2AC1ABD2C1D5AF5] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00001E25A89A55BF477BB2AC1ABD2C1D5AF5
14 Feb 2024 17:29:52 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRWgIFYiA RemoteTransferManager PoolAcceptFile 00007BF763650E9E4C8590B5A92C4B56076B] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00007BF763650E9E4C8590B5A92C4B56076B
14 Feb 2024 17:32:18 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWgrZupA RemoteTransferManager PoolAcceptFile 00002E8C74A458C54143BD9899950E776D4B] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:32:18 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWgrZupA RemoteTransferManager PoolAcceptFile 00002E8C74A458C54143BD9899950E776D4B] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:32:18 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWgrZupA RemoteTransferManager PoolAcceptFile 00002E8C74A458C54143BD9899950E776D4B] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00002E8C74A458C54143BD9899950E776D4B
14 Feb 2024 17:32:58 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRWg0krGA RemoteTransferManager PoolAcceptFile 000088A0502B4EAF4459A48DF74A580B3DE5] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:32:58 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRWg0krGA RemoteTransferManager PoolAcceptFile 000088A0502B4EAF4459A48DF74A580B3DE5] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:32:58 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRWg0krGA RemoteTransferManager PoolAcceptFile 000088A0502B4EAF4459A48DF74A580B3DE5] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/000088A0502B4EAF4459A48DF74A580B3DE5
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhYQD0A RemoteTransferManager PoolAcceptFile 0000210D1D5229FF45B2A03E289AE294F37F] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhYQuyA RemoteTransferManager PoolAcceptFile 00001E183FD06C834DB4A9A53A31E64993F3] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRWhYQ4jA RemoteTransferManager PoolAcceptFile 000019AD237AEDAF465D8C763711BAD4B5E5] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhW9KJg RemoteTransferManager PoolAcceptFile 0000E7E5778BEAB5454EAC412A9FB135E0F3] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhYQuyA RemoteTransferManager PoolAcceptFile 00001E183FD06C834DB4A9A53A31E64993F3] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhW9KJg RemoteTransferManager PoolAcceptFile 0000E7E5778BEAB5454EAC412A9FB135E0F3] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhYQuyA RemoteTransferManager PoolAcceptFile 00001E183FD06C834DB4A9A53A31E64993F3] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00001E183FD06C834DB4A9A53A31E64993F3
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhW9KJg RemoteTransferManager PoolAcceptFile 0000E7E5778BEAB5454EAC412A9FB135E0F3] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000E7E5778BEAB5454EAC412A9FB135E0F3
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhYQD0A RemoteTransferManager PoolAcceptFile 0000210D1D5229FF45B2A03E289AE294F37F] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRWhYQ4jA RemoteTransferManager PoolAcceptFile 000019AD237AEDAF465D8C763711BAD4B5E5] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRWhYQD0A RemoteTransferManager PoolAcceptFile 0000210D1D5229FF45B2A03E289AE294F37F] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000210D1D5229FF45B2A03E289AE294F37F
14 Feb 2024 17:35:22 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRWhYQ4jA RemoteTransferManager PoolAcceptFile 000019AD237AEDAF465D8C763711BAD4B5E5] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/000019AD237AEDAF465D8C763711BAD4B5E5
14 Feb 2024 17:36:02 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWhgZb+g RemoteTransferManager PoolAcceptFile 000021452F2CBD3943A28CBD2E4B4FE8FF2F] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:36:02 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWhgZb+g RemoteTransferManager PoolAcceptFile 000021452F2CBD3943A28CBD2E4B4FE8FF2F] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:36:02 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWhgZb+g RemoteTransferManager PoolAcceptFile 000021452F2CBD3943A28CBD2E4B4FE8FF2F] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/000021452F2CBD3943A28CBD2E4B4FE8FF2F
14 Feb 2024 17:40:28 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRWifr6cg RemoteTransferManager PoolAcceptFile 0000744D3AF6E1F54BB98E2C3A60AAA6DD7F] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:40:28 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRWifr6cg RemoteTransferManager PoolAcceptFile 0000744D3AF6E1F54BB98E2C3A60AAA6DD7F] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:40:28 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRWifr6cg RemoteTransferManager PoolAcceptFile 0000744D3AF6E1F54BB98E2C3A60AAA6DD7F] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000744D3AF6E1F54BB98E2C3A60AAA6DD7F
14 Feb 2024 17:41:58 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWi1OtnA RemoteTransferManager PoolAcceptFile 0000C2832A4619134C37A22A001AA60BA486] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:41:58 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWi1OtnA RemoteTransferManager PoolAcceptFile 0000C2832A4619134C37A22A001AA60BA486] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:41:58 (badger2_lhcbtape) [door:webdav2884-seacow14@webdav2884-seacow14Domain:AAYRWi1OtnA RemoteTransferManager PoolAcceptFile 0000C2832A4619134C37A22A001AA60BA486] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000C2832A4619134C37A22A001AA60BA486
14 Feb 2024 17:42:11 (badger2_lhcbtape) [door:webdav2884-walrus14@webdav2884-walrus14Domain:AAYRWi5QwQg RemoteTransferManager PoolAcceptFile 000074DBC352D69F41FC8C5919AEA31D9342] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:42:11 (badger2_lhcbtape) [door:webdav2884-walrus14@webdav2884-walrus14Domain:AAYRWi5QwQg RemoteTransferManager PoolAcceptFile 000074DBC352D69F41FC8C5919AEA31D9342] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:42:11 (badger2_lhcbtape) [door:webdav2884-walrus14@webdav2884-walrus14Domain:AAYRWi5QwQg RemoteTransferManager PoolAcceptFile 000074DBC352D69F41FC8C5919AEA31D9342] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/000074DBC352D69F41FC8C5919AEA31D9342
14 Feb 2024 17:42:25 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRWi9IBdA RemoteTransferManager PoolAcceptFile 000051AA95A5ADC546219D22EAFAF9175591] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:42:25 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRWi9IBdA RemoteTransferManager PoolAcceptFile 000051AA95A5ADC546219D22EAFAF9175591] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:42:25 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRWi9IBdA RemoteTransferManager PoolAcceptFile 000051AA95A5ADC546219D22EAFAF9175591] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/000051AA95A5ADC546219D22EAFAF9175591
14 Feb 2024 17:53:00 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRWlS+ibA RemoteTransferManager PoolAcceptFile 000035310FD59ABC4CC696968F572C840186] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:53:00 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRWlS+ibA RemoteTransferManager PoolAcceptFile 000035310FD59ABC4CC696968F572C840186] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:53:00 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRWlS+ibA RemoteTransferManager PoolAcceptFile 000035310FD59ABC4CC696968F572C840186] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/000035310FD59ABC4CC696968F572C840186
14 Feb 2024 17:55:23 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRWl1KGVA RemoteTransferManager PoolAcceptFile 0000BCFDCDB84F1E46E898E101D61232F82F] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:55:23 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRWl1KGVA RemoteTransferManager PoolAcceptFile 0000BCFDCDB84F1E46E898E101D61232F82F] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:55:23 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRWl1KGVA RemoteTransferManager PoolAcceptFile 0000BCFDCDB84F1E46E898E101D61232F82F] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000BCFDCDB84F1E46E898E101D61232F82F
14 Feb 2024 17:59:43 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWmzNdUg RemoteTransferManager PoolAcceptFile 000040EB4CFE48784E6897F5FB4A50C309CB] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 17:59:43 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWmzNdUg RemoteTransferManager PoolAcceptFile 000040EB4CFE48784E6897F5FB4A50C309CB] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 17:59:43 (badger2_lhcbtape) [door:webdav2884-penguin12@webdav2884-penguin12Domain:AAYRWmzNdUg RemoteTransferManager PoolAcceptFile 000040EB4CFE48784E6897F5FB4A50C309CB] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/000040EB4CFE48784E6897F5FB4A50C309CB
14 Feb 2024 19:32:42 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRW7lTISg RemoteTransferManager PoolAcceptFile 00001E1ED9D5337B42F4AF4E50E90D3D115A] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:32:42 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRW7mchjg RemoteTransferManager PoolAcceptFile 00000F217BC879B048378A17C92727171BC7] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:32:42 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRW7mchjg RemoteTransferManager PoolAcceptFile 00000F217BC879B048378A17C92727171BC7] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:32:42 (badger2_lhcbtape) [door:webdav2884-penguin14@webdav2884-penguin14Domain:AAYRW7mchjg RemoteTransferManager PoolAcceptFile 00000F217BC879B048378A17C92727171BC7] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00000F217BC879B048378A17C92727171BC7
14 Feb 2024 19:32:42 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRW7lTISg RemoteTransferManager PoolAcceptFile 00001E1ED9D5337B42F4AF4E50E90D3D115A] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:32:42 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRW7lTISg RemoteTransferManager PoolAcceptFile 00001E1ED9D5337B42F4AF4E50E90D3D115A] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00001E1ED9D5337B42F4AF4E50E90D3D115A
14 Feb 2024 19:34:15 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRW7755EA RemoteTransferManager PoolAcceptFile 000023698CA1B43E4A62941DEFEE98B6042B] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:34:15 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRW7755EA RemoteTransferManager PoolAcceptFile 000023698CA1B43E4A62941DEFEE98B6042B] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:34:15 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRW7755EA RemoteTransferManager PoolAcceptFile 000023698CA1B43E4A62941DEFEE98B6042B] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/000023698CA1B43E4A62941DEFEE98B6042B
14 Feb 2024 19:34:28 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRW7+wqVA RemoteTransferManager PoolAcceptFile 00005162EC42A6874E43BC91D6CA9200F13C] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:34:28 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRW8BCtAg RemoteTransferManager PoolAcceptFile 0000A246CDBC4B0348E8899AF248CE14737B] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:34:28 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRW8BCtAg RemoteTransferManager PoolAcceptFile 0000A246CDBC4B0348E8899AF248CE14737B] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:34:28 (badger2_lhcbtape) [door:webdav2884-penguin13@webdav2884-penguin13Domain:AAYRW8BCtAg RemoteTransferManager PoolAcceptFile 0000A246CDBC4B0348E8899AF248CE14737B] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000A246CDBC4B0348E8899AF248CE14737B
14 Feb 2024 19:34:28 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRW7+wqVA RemoteTransferManager PoolAcceptFile 00005162EC42A6874E43BC91D6CA9200F13C] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:34:28 (badger2_lhcbtape) [door:webdav2884-seacow13@webdav2884-seacow13Domain:AAYRW7+wqVA RemoteTransferManager PoolAcceptFile 00005162EC42A6874E43BC91D6CA9200F13C] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00005162EC42A6874E43BC91D6CA9200F13C
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRW+AaiyA RemoteTransferManager PoolAcceptFile 0000C0969577143043EFAF7C68A994F2861C] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRW+AfYWg RemoteTransferManager PoolAcceptFile 0000194AC104D8F74A55B1AD1F9201980D50] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRW+AZYkA RemoteTransferManager PoolAcceptFile 00002ED1B81380B2402CA2CA610FE1DF3387] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRW+AW3bg RemoteTransferManager PoolAcceptFile 0000C106E941B8DD4570950C28C161DB08E7] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRW+AXJ/A RemoteTransferManager PoolAcceptFile 0000FDC8B5A00545419DB4E5C6C88D535787] Transfer failed: No route to host (Host unreachable)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRW+AfYWg RemoteTransferManager PoolAcceptFile 0000194AC104D8F74A55B1AD1F9201980D50] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRW+AaiyA RemoteTransferManager PoolAcceptFile 0000C0969577143043EFAF7C68A994F2861C] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRW+AZYkA RemoteTransferManager PoolAcceptFile 00002ED1B81380B2402CA2CA610FE1DF3387] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRW+AW3bg RemoteTransferManager PoolAcceptFile 0000C106E941B8DD4570950C28C161DB08E7] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRW+AfYWg RemoteTransferManager PoolAcceptFile 0000194AC104D8F74A55B1AD1F9201980D50] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000194AC104D8F74A55B1AD1F9201980D50
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2884-seacow12@webdav2884-seacow12Domain:AAYRW+AW3bg RemoteTransferManager PoolAcceptFile 0000C106E941B8DD4570950C28C161DB08E7] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000C106E941B8DD4570950C28C161DB08E7
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav2880-webdav-cert@webdav2880-webdav-certDomain:AAYRW+AaiyA RemoteTransferManager PoolAcceptFile 0000C0969577143043EFAF7C68A994F2861C] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000C0969577143043EFAF7C68A994F2861C
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRW+AZYkA RemoteTransferManager PoolAcceptFile 00002ED1B81380B2402CA2CA610FE1DF3387] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/00002ED1B81380B2402CA2CA610FE1DF3387
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRW+AXJ/A RemoteTransferManager PoolAcceptFile 0000FDC8B5A00545419DB4E5C6C88D535787] Transfer failed in post-processing: File size mismatch (expected=5368709120, actual=0)
14 Feb 2024 19:43:32 (badger2_lhcbtape) [door:webdav443-webdav-cert@webdav443-webdav-certDomain:AAYRW+AXJ/A RemoteTransferManager PoolAcceptFile 0000FDC8B5A00545419DB4E5C6C88D535787] Failed to read file size: java.nio.file.NoSuchFileException: /space/lhcb/tape/pool/data/0000FDC8B5A00545419DB4E5C6C88D535787
14 Feb 2024 22:21:42 (badger2_lhcbtape) [] Flush of 0000F727C00F2D234DF0B5B70069DEE9F0F2 failed with: java.lang.IllegalArgumentException: Negative number of active requests.

Looking at the last NoSuchFileException:

[root@dcmain /var/lib/dcache/billing/2024/02]# grep '0000FDC8B5A00545419DB4E5C6C88D535787' *
billing-2024.02.14:02.14 19:43:32 [pool:badger2_lhcbtape@badger2_lhcbtapeDomain:remove] [0000FDC8B5A00545419DB4E5C6C88D535787,0] [Unknown] lhcb:tape@osm {0:"Transfer failed and replica is empty"}
billing-2024.02.14:02.14 19:43:32 [pool:badger2_lhcbtape:transfer] [0000FDC8B5A00545419DB4E5C6C88D535787,0] [/pnfs/grid.sara.nl/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030213/datamover08-m/4c203272-8006-4e94-9d1c-83afe0eb4a4c.raw] lhcb:tape@osm 0 0 true {RemoteHttpsDataTransfer-1.1:https://guppy1.grid.surfsara.nl:2883/pnfs/grid.sara.nl/data/lhcb/LHCb-Disk/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030213/datamover08-m/4c203272-8006-4e94-9d1c-83afe0eb4a4c.raw} [door:RemoteTransferManager@transfermanagersDomain:1707936159590-896186] {10027:"No route to host (Host unreachable)"}
billing-error-2024.02.14:02.14 19:43:32 [pool:badger2_lhcbtape:transfer] [0000FDC8B5A00545419DB4E5C6C88D535787,0] [/pnfs/grid.sara.nl/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030213/datamover08-m/4c203272-8006-4e94-9d1c-83afe0eb4a4c.raw] lhcb:tape@osm 0 0 true {RemoteHttpsDataTransfer-1.1:https://guppy1.grid.surfsara.nl:2883/pnfs/grid.sara.nl/data/lhcb/LHCb-Disk/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030213/datamover08-m/4c203272-8006-4e94-9d1c-83afe0eb4a4c.raw} [door:RemoteTransferManager@transfermanagersDomain:1707936159590-896186] {10027:"No route to host (Host unreachable)"}

Oops, guppy1, that node has been decommissioned... It was still in the old DNS round robin webdav.grid.sara.nl. The new DNS RR is webdav.grid.surfsara.nl and that one is OK.

[root@dcmain /var/lib/dcache/billing/2024/02]# host webdav.grid.sara.nl | grep -2 145.100.33.67
webdav.grid.sara.nl has address 145.100.33.224
webdav.grid.sara.nl has address 145.100.33.211
webdav.grid.sara.nl has address 145.100.33.67
webdav.grid.sara.nl has address 145.100.33.210
webdav.grid.sara.nl has address 145.100.33.226

I now made webdav.grid.sara.nl a CNAME to webdav.grid.surfsara.nl.

So, I'm guessing that LHCb was using FTS third party transfer to move this file:

/pnfs/grid.sara.nl/data/lhcb/LHCb-Disk/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030213/datamover08-m/4c203272-8006-4e94-9d1c-83afe0eb4a4c.raw

To this location:

/pnfs/grid.sara.nl/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030213/datamover08-m/4c203272-8006-4e94-9d1c-83afe0eb4a4c.raw

But because they used a DNS RR with old information, the file was collected from a non-existing door node. Not a surprise that failed.

OK, so I found a problem in our setup, but no idea if it had anything to do with this issue.

I also had a look at the "Negative number of active requests" error. I looked at the timestamp, 22:21:42, and looked up in our HSM script's log to see what happened. I could see that writing to tape of file 0000F727C00F2D234DF0B5B70069DEE9F0F2 succeeded, and that it was about to exit with exit code 0.

02/14/24-22:21:42.109173652  7410 dmfcp.sh[399781][v20.3,main,1196,INFO]: put: Completed 'put' for 0000F727C00F2D234DF0B5B70069DEE9F0F2.
02/14/24-22:21:42.116836359  7410 dmfcp.sh[399781][v20.3,main,1197,INFO]: put: Reporting back to dCache: osm://osm/?store=lhcb&group=tape&bfid=/space/lhcb/tape/pool/data/0000F727C00F2D234DF0B5B70069DEE9F0F2
02/14/24-22:21:42.126051709  7410 dmfcp.sh[399781][v20.3,main,1252,INFO]: : Exit code 0

In the billing file I can indeed see a store with error code 0 (succeeded), but also a 666:"Negative number of active requests":

[root@dcmain /var/lib/dcache/billing/2024/02]# grep '0000F727C00F2D234DF0B5B70069DEE9F0F2' billing-2024.02.*
billing-2024.02.14:02.14 05:50:51 [pool:badger2_lhcbtape:transfer] [0000F727C00F2D234DF0B5B70069DEE9F0F2,5368709120] [/pnfs/grid.sara.nl/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030202/datamover01-m/93e8f0b6-f1c0-4c2f-83c1-d20c9c61661e.raw] lhcb:tape@osm 5368709120 1781019 true {RemoteHttpsDataTransfer-1.1:https://penguin13.grid.surfsara.nl:2883/pnfs/grid.sara.nl/data/lhcb/LHCb-Disk/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030202/datamover01-m/93e8f0b6-f1c0-4c2f-83c1-d20c9c61661e.raw} [door:RemoteTransferManager@transfermanagersDomain:1707884377824-810191] {0:""}
billing-2024.02.14:02.14 05:50:51 [door:SrmManager@srmmanagerDomain:request] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lbprods/CN=693025/CN=Robot: LHCb offline productions":36521:31637:unknown] [0000F727C00F2D234DF0B5B70069DEE9F0F2,5368709120] [/pnfs/grid.sara.nl/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030202/datamover01-m/93e8f0b6-f1c0-4c2f-83c1-d20c9c61661e.raw] lhcb:tape@osm 0 0 {0:""}
billing-2024.02.14:02.14 17:09:29 [pool:badger2_lhcbtape@badger2_lhcbtapeDomain:store] [0000F727C00F2D234DF0B5B70069DEE9F0F2,5368709120] [/pnfs/grid.sara.nl/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030202/datamover01-m/93e8f0b6-f1c0-4c2f-83c1-d20c9c61661e.raw] lhcb:tape@osm 67422 18665416 {0:""}
billing-2024.02.14:02.14 17:09:29 [pool:badger2_lhcbtape@badger2_lhcbtapeDomain:store] [0000F727C00F2D234DF0B5B70069DEE9F0F2,5368709120] [/pnfs/grid.sara.nl/data/lhcb/LHCb-Tape/lhcb/dataChallenge/lhcb/EXPFEB2022/TURBO/2022030202/datamover01-m/93e8f0b6-f1c0-4c2f-83c1-d20c9c61661e.raw] lhcb:tape@osm 67424 18665416 {666:"Negative number of active requests"}

But the timestamps are completely different: 17:09:29 (5 minutes after pool start) instead of 22:21:42.

Not sure if this of any help.

ageorget commented 7 months ago

Hi,

This happened more and more often at IN2P3 for Atlas too in the last weeks. (dCache 9.2.14) The Store job queue stays stuck displaying 1 active store request but there is no store request in progress :

  Job Queues (active/queued)
    to store   1/0
    from store 0/0
    delete     0/0

(pool-atlas-tapebuffer-li418a@ccdcatli418-pool-atlas-tapebuffer-li418a-Domain) ageorget > st ls
(pool-atlas-tapebuffer-li418a@ccdcatli418-pool-atlas-tapebuffer-li418a-Domain) ageorget >

I also see some Negative number of active requests or Negative number of canceled requests like Onno reported.

(pool-atlas-tapebuffer-li539a) [] Flush of 0000EDF0656D3F404B21A6DAEE0F14737DB7 failed with: java.lang.IllegalArgumentException: Negative number of canceled requests.
(pool-atlas-tapebuffer-li418a) [] Flush of 0000940F68BA82694678B5FA835769D3CA7F failed with: java.lang.IllegalArgumentException: Negative number of canceled requests.

24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: 24 Apr 2024 12:15:02 (pool-atlas-tapebuffer-li417a) [] Unexpected failure during state change notification
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: java.lang.IllegalArgumentException: Negative number of active requests
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at org.dcache.pool.nearline.NearlineStorageHandler$QueueStat.<init>(NearlineStorageHandler.java:151)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequest.incCancelFromActive(NearlineStorageHandler.java:688)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequest.cancel(NearlineStorageHandler.java:594)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequestContainer.cancel(NearlineStorageHandler.java:808)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at org.dcache.pool.nearline.NearlineStorageHandler.stateChanged(NearlineStorageHandler.java:487)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at org.dcache.pool.repository.v5.StateChangeListeners.lambda$stateChanged$0(StateChangeListeners.java:60)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Apr 24 12:15:02 ccdcatli417 dcache@ccdcatli417-pool-atlas-tapebuffer-li417a-Domain[4797]: at java.base/java.lang.Thread.run(Thread.java:829)

There is no error on the flush side, no flush process running on the server. I only see usual errors with Atlas TPC transfers.

I attached the log files of the 2 pools currently stuck (cell log, pinboard, info cmd) : pool-atlas-tapebuffer-li539a.log pool-atlas-tapebuffer-li418a.log

Adrien

kofemann commented 7 months ago

@ageorget does this happen with stores? Can you check, if those files still exist?

ageorget commented 7 months ago

Yes, this only happens with stores on tape flush pools.

For the negative number of active requests error, sometimes it seems that the file was successfully copied on tape, the store request succeed but something went wrong when the pool tried to update the number of active requests


Apr 26 04:20:20 ccdcatli419 sudo[177818]:     root : TTY=unknown ; PWD=/usr/share/dcache ; USER=atlagrid ; COMMAND=/opt/software/rfio-hpss/prod/bin/rfcp /data/pool-atlas-tapebuffer-li419a/pool/data/0000EDB9FE41F4B145A6932EA08DA9BAA432 cchpssdcache:/hpss5/dcache/atlas/mctape/2024/04/0000EDB9FE41F4B145A6932EA08DA9BAA432
Apr 26 04:20:21 ccdcatli419 dcache@ccdcatli419-pool-atlas-tapebuffer-li419a-Domain[95245]: 26 Apr 2024 04:20:21 (pool-atlas-tapebuffer-li419a) [] Flush of 0000EDB9FE41F4B145A6932EA08DA9BAA432 failed with: java.lang.IllegalArgumentException: Negative number of active requests.

26/04/24 04:20:20 INFO     177814 ['put', '0000EDB9FE41F4B145A6932EA08DA9BAA432', '/data/pool-atlas-tapebuffer-li419a/pool/data/0000EDB9FE41F4B145A6932EA08DA9BAA432', '-si=size=286665458;new=true;stored=false;sClass=mctape:atlas;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/in2p3.fr/data/atlas/atlasmctape/mc23_13p6TeV/EVNT/e8514_e8528/mc23_13p6TeV.700841.Sh_2214_Zmumu_mZ_105_ECMS.merge.EVNT.e8514_e8528_tid37663151_00/EVNT.37663151._004077.pool.root.1;uid=3327;writeToken=61;StorageGroup=cchpssdcache:0;gid=124;StoreName=mctape;SpaceTokenDescription=ATLASMCTAPE;xattr.xdg.origin.url=https://se1.farm.particle.cz:443/atlas/atlasdatadisk/rucio/mc23_13p6TeV/b6/18/EVNT.37663151._004077.pool.root.1?copy_mode=pull;links=00003A5D0393940D4B66AB22E0A35A4D9D6D EVNT.37663151._004077.pool.root.1;SpaceToken=61;flag-c=1:929388c2;LinkGroupId=10;store=mctape;group=atlas;bfid=<Unknown>;', '-pnfs=/pnfs/in2p3.fr', '-hsmfs=/hpss5/dcache', '-remove=/opt/software/rfio-hpss/prod/bin/rfrm', '-method=treqs', '-c:remove=1', '-copy=/opt/software/rfio-hpss/prod/bin/rfcp']
26/04/24 04:20:20 INFO     177814 sudo -u atlagrid /opt/software/rfio-hpss/prod/bin/rfcp /data/pool-atlas-tapebuffer-li419a/pool/data/0000EDB9FE41F4B145A6932EA08DA9BAA432 cchpssdcache:/hpss5/dcache/atlas/mctape/2024/04/0000EDB9FE41F4B145A6932EA08DA9BAA432
26/04/24 04:20:21 INFO     177814 286665458 bytes in 1 seconds through local (in) and p1p1 (out) (279946 KB/sec)
onnozweers commented 2 months ago

Any news on this? We again have 3 Atlas write tape pools that filled up after "Negative number of active requests".

kofemann commented 2 months ago

Is it stores or restores?

onnozweers commented 2 months ago

Stores. We only see this error on pools that are dedicated for writing to tape. We never see this for pools that read from tape. We have some pools that do both stores and restores, and I've never seen this error there, but that might have to do with the small data volumes on these pools.

ageorget commented 1 month ago

Still happens with Atlas and LHCb dedicated store pools at IN2P3. Only a pool restart can help for now.

kofemann commented 1 month ago

Unfortunately, I am unable to reproduce it. Thus, please enable jfr in the pool with a config available at https://desy.de/~tigran/exceptions-only.jfc. Go into the System cell of the pool domain and run jfr start /tmp/exceptions-only.jfc (by assuming that the flight recorder config is stored in /tmp/exceptions-only.jfc).

When the error occurs, run jfr stop and send me the generated files (the path will be displayed on the screen)

kofemann commented 1 month ago

@ageorget @onnozweers I have a theory. Can you install a custom RPM at one of your pools? Which version of dCache do you run?

kofemann commented 1 month ago

A RPM with a potential fix is here: https://www.desy.de/~tigran/dcache-9.2.27.1397bbb-1.noarch.rpm

You should get a warning in the log:

Request xxx was already removed from the queue.
ageorget commented 1 month ago

Hi @kofemann We are running dCache 9.2.14 on pools. I'm going to install your RPM on some pools. Thanks for having a look on this issue!

ageorget commented 1 month ago

Hi @kofemann

I'm afraid this issue is not yet fixed. One pool running your 9.2.27 patch is currently stuck flushing files to tape after raising the exception Negative number of active requests

Oct 13 02:31:42 ccdcacli542 sudo[293556]:     root : PWD=/usr/share/dcache ; USER=atlagrid ; COMMAND=/opt/software/rfio-hpss/prod/bin/rfcp /data/pool-atlas-tapebuffer-li542a/pool/data/0000AA93F9CCB7A041A39265BEEB170B09DE cchpssdcache:/hpss5/dcache/atlas/datatape/2024/10/0000AA93F9CCB7A041A39265BEEB170B09DE
Oct 13 02:31:43 ccdcacli542 sudo[293568]:     root : PWD=/usr/share/dcache ; USER=atlagrid ; COMMAND=/opt/software/rfio-hpss/prod/bin/rfcp /data/pool-atlas-tapebuffer-li542a/pool/data/00000AF04BC08CC24237AC57BC7A3A38209E cchpssdcache:/hpss5/dcache/atlas/datatape/2024/10/00000AF04BC08CC24237AC57BC7A3A38209E
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]: 13 Oct 2024 02:31:44 (pool-atlas-tapebuffer-li542a) [] Flush of 000026C6070D0DFD4D4EABF807D9120C9FED failed with: java.lang.IllegalArgumentException: Negative number of queued requests.
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]: 13 Oct 2024 02:31:44 (pool-atlas-tapebuffer-li542a) [] Uncaught exception in thread pool-36-thread-183
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]: java.lang.IllegalArgumentException: Negative number of active requests
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.pool.nearline.NearlineStorageHandler$QueueStat.<init>(NearlineStorageHandler.java:151)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequest.decActive(NearlineStorageHandler.java:721)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequest.removeFromQueue(NearlineStorageHandler.java:653)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequestContainer.remove(NearlineStorageHandler.java:896)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequestContainer.removeAndCallback(NearlineStorageHandler.java:868)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.pool.nearline.NearlineStorageHandler$FlushRequestImpl.done(NearlineStorageHandler.java:1208)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.pool.nearline.NearlineStorageHandler$FlushRequestImpl.failed(NearlineStorageHandler.java:1059)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.pool.nearline.AbstractBlockingNearlineStorage$Task.run(AbstractBlockingNearlineStorage.java:382)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.util.CDCExecutorServiceDecorator$WrappedRunnable.run(CDCExecutorServiceDecorator.java:130)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at org.dcache.util.BoundedExecutor$Worker.run(BoundedExecutor.java:247)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Oct 13 02:31:44 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]:         at java.base/java.lang.Thread.run(Thread.java:829)
Oct 13 02:38:50 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]: 13 Oct 2024 02:38:50 (pool-atlas-tapebuffer-li542a) [door:webdav-ccdcatli367@webdav-ccdcatli367Domain:AAYkUHptkyg RemoteTransferManager PoolAcceptFile 000057189123D67D4DE9B41652C25A3B675E] Transfer failed in post-processing: No such file or directory: 000057189123D67D4DE9B41652C25A3B675E
Oct 13 02:38:50 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]: 13 Oct 2024 02:38:50 (pool-atlas-tapebuffer-li542a) [door:webdav-ccdcatli367@webdav-ccdcatli367Domain:AAYkUHptkyg RemoteTransferManager PoolAcceptFile 000057189123D67D4DE9B41652C25A3B675E] Failed to read file size: java.nio.file.NoSuchFileException: /data/pool-atlas-tapebuffer-li542a/pool/data/000057189123D67D4DE9B41652C25A3B675E
Oct 13 02:42:19 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]: 13 Oct 2024 02:42:19 (pool-atlas-tapebuffer-li542a) [] Trying to kill <webdav> id=67118110
Oct 13 02:42:19 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]: 13 Oct 2024 02:42:19 (pool-atlas-tapebuffer-li542a) [] Trying to kill <webdav> id=67118112
Oct 13 02:42:19 ccdcacli542 dcache@ccdcacli542-pool-atlas-tapebuffer-li542a-Domain[85118]: 13 Oct 2024 02:42:19 (pool-atlas-tapebuffer-li542a) [] Trying to kill <webdav> id=67118113
/usr/bin/dcache version                                                                                            
9.2.27

I ran the JFR but after the exception was raised. So I don't know if it's really usefull (pool is not yet restarted to unstuck flushes). ccdcacli542-pool-atlas-tapebuffer-li542a-Domain_14503209413572630220.jfr.gz

kofemann commented 1 month ago

That's a shame!