dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
285 stars 136 forks source link

HSM flushing stuck #6426

Closed ageorget closed 1 year ago

ageorget commented 2 years ago

Hi,

Since we upgraded to dCache 6.2.35 (from 5.2), I sometimes observed tape flushing pool beeing stuck and stop flushing files on HSM without any reasons.

There are many files on the pool waiting to be flush, the next flush is scheduled but nothing happen :


--- flush (Controller for centralising flushing) ---
   Flush interval                : 60000 ms
   Maximum classes flushing      : 1000
   Minimum flush delay on error  : 60000 ms
   Next flush                    : Mon Jan 24 12:12:47 CET 2022

> flush ls
Class                 Active   Error  Last/min  Requests    Failed
datatape:atlas@osm         1       4      1046      1352         0

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   2/0
    from store 0/0
    delete     0/0

image

The flush ls command shows 1 active flush but there's no more active process alive, the Nearline storage manager also shows 2 active store requests. In the pool pinboard, the last message I see :
24 Jan 2022 11:09:47 [flush] [] Flushing SCI=mctape:atlas@osm;def=false;exp=0;pend=0;maxTotal=0;waiting=1 24 Jan 2022 11:09:47 [pool-10-thread-12245] [] remove entry 0000E34B76CD96F6442EAC7C5A6BD1135AA6: file deleted before being flushed 24 Jan 2022 11:09:47 [pool-32-thread-6219] [] Flush of 0000E34B76CD96F6442EAC7C5A6BD1135AA6 failed with: CacheException(rc=10001;msg=File not found in name space during pre-flush check.).

In the pool log file, I see multiple tries to kill 2 webdav TPC transfers (67281114 and 67281045). The flush stopped working at 10:48 :

24 Jan 2022 10:38:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:40:20 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli276@webdav-ccdcatli276Domain:AAXWULYCgzg RemoteTransferManager PoolAcceptFile 0000E4333B013CD84BEDA7069FA061F78558] Transfer failed: rejected GET: 500 Server Error
24 Jan 2022 10:40:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:40:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:41:00 (pool-atlas-tapebuffer-li308a) [] Flush of 0000E4333B013CD84BEDA7069FA061F78558 failed with: CacheException(rc=10001;msg=File not found in name space during pre-flush check.).
24 Jan 2022 10:42:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:42:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:44:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:44:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:46:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:46:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:48:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:48:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:50:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:50:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:51:56 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli276@webdav-ccdcatli276Domain:AAXWUOLi9jA RemoteTransferManager PoolAcceptFile 000071E57E4B25E845789551322EEBD3A455] Marking pool entry 000071E57E4B25E845789551322EEBD3A455 on pool-atlas-tapebuffer-li308a as BROKEN
24 Jan 2022 10:51:56 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli276@webdav-ccdcatli276Domain:AAXWUOLi9jA RemoteTransferManager PoolAcceptFile 000071E57E4B25E845789551322EEBD3A455] Transfer failed in post-processing: Checksum mismatch (expected=[1:acc4e187, 2:ea73c4856de5da08521c4c81217f4131], actual=[1:68e9b879, 2:ea73c4856de5da08521c4c81217f4131])
24 Jan 2022 10:52:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:52:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:54:39 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli223@webdav-ccdcatli223Domain:AAXWUPBbV1g RemoteTransferManager PoolAcceptFile 0000FD754E77757E4CAE817E3FB17FB22A92] Marking pool entry 0000FD754E77757E4CAE817E3FB17FB22A92 on pool-atlas-tapebuffer-li308a as BROKEN
24 Jan 2022 10:54:39 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli223@webdav-ccdcatli223Domain:AAXWUPBbV1g RemoteTransferManager PoolAcceptFile 0000FD754E77757E4CAE817E3FB17FB22A92] Transfer failed in post-processing: Checksum mismatch (expected=[1:31804d85, 2:a2338f4926e4234ff708bfccdfdb7012], actual=[1:abe6f82a, 2:a2338f4926e4234ff708bfccdfdb7012])
24 Jan 2022 10:54:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:54:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:56:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:56:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:56:59 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWUPiWgPA RemoteTransferManager PoolAcceptFile 000092ECA93120E64AE18D66F30C27DD2036] Marking pool entry 000092ECA93120E64AE18D66F30C27DD2036 on pool-atlas-tapebuffer-li308a as BROKEN
24 Jan 2022 10:56:59 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWUPiWgPA RemoteTransferManager PoolAcceptFile 000092ECA93120E64AE18D66F30C27DD2036] Transfer failed in post-processing: Checksum mismatch (expected=[1:e15ab265, 2:0df6e5af8600b96da4e163fd322b32b4], actual=[1:96d43e51, 2:0df6e5af8600b96da4e163fd322b32b4])
24 Jan 2022 10:57:52 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWUPg92aA RemoteTransferManager PoolAcceptFile 0000F53D45E25C8E455EA42537E267662CEA] Marking pool entry 0000F53D45E25C8E455EA42537E267662CEA on pool-atlas-tapebuffer-li308a as BROKEN
24 Jan 2022 10:57:52 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWUPg92aA RemoteTransferManager PoolAcceptFile 0000F53D45E25C8E455EA42537E267662CEA] Transfer failed in post-processing: Checksum mismatch (expected=[1:acc4e187, 2:ea73c4856de5da08521c4c81217f4131], actual=[1:68e9b879, 2:ea73c4856de5da08521c4c81217f4131])
24 Jan 2022 10:58:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045

> mover ls
67281114 : CANCELED : 00001FA2D5EBE8A8458584D0CAF266E860BC IoMode=[READ, WRITE, CREATE] h={org.dcache.pool.movers.RemoteHttpsDataTransferProtocol@6a87cb64} bytes=1007318600 time/sec=7200 LM=31726 si={mctape:atlas}
67281045 : CANCELED : 00003CA3E034836040D9AB387F9FB6204997 IoMode=[READ, WRITE, CREATE] h={org.dcache.pool.movers.RemoteHttpsDataTransferProtocol@6b277f19} bytes=1109713400 time/sec=7200 LM=39032 si={mctape:atlas}

Can it be related to these stuck canceled TPC transfers?

The only solution to unstuck flush is to restart the pool. Best, Adrien

paulmillar commented 2 years ago

The repeated attempts to kill those two transfers might be unrelated, but they are still a bug.

Just as extra information: once the JTM has decided to kill a mover, the mover goes into the CANCELED (sic) state. Any subsequent attempts by the JTM to kill that transfer has no effect, despite being logged.

So, the question is: why are these transfers not being cancelled correctly.

There's is (perhaps) an ancillary question: why should transfers in state CANCELED affect storing data onto tape -- assuming that is what is causing the store operations to fail.

Adrien, do you see additional log entries from subsequent scheduled flush/store operations?

ageorget commented 2 years ago

Hi Paul,

I couldn't find more additional log. What's strange is the Nearline storage manager which indicates 2 active store requests in the info pool whereas there's no more store requests in progress (and the hsm c:puts=1). Restarting the pool deleted this canceled transfers, removed these 2 files and flush was going on

24 Jan 2022 12:30:25 (pool-atlas-tapebuffer-li308a) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
24 Jan 2022 12:30:25 (pool-atlas-tapebuffer-li308a) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data/pool-atlas-tapebuffer-li308a/pool/data;meta=/data/pool-atlas-tapebuffer-li308a/pool/meta]]].
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering 00003CA3E034836040D9AB387F9FB6204997...
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering: Fetched storage info for 00003CA3E034836040D9AB387F9FB6204997 from name space.
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering: Removed 00003CA3E034836040D9AB387F9FB6204997 because name space entry was deleted.
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering 00001FA2D5EBE8A8458584D0CAF266E860BC...
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering: Fetched storage info for 00001FA2D5EBE8A8458584D0CAF266E860BC from name space.
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering: Removed 00001FA2D5EBE8A8458584D0CAF266E860BC because name space entry was deleted.
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Pool mode changed to enabled

I really don't know if there's a link between these canceled transfers and flush operations beeing stuck. Currently I see some CANCELED transfers on the pool but flush is running as normal...

kofemann commented 2 years ago

Hi @ageorget ,

if you check those CANCELED entries, are they all for removed files?

ageorget commented 2 years ago

Hi @kofemann,

If I check these 2 canceled entries, the request failed with

error 24 java.io.IOException: timed out while waiting for mover reply

reported by RemoteTransferManager@srmDomain.

Webdav door log :

January 24th 2022, 01:39:37.000 door:webdav-ccdcatli276@webdav-ccdcatli276Domain:AAXWSTQJ1tA RemoteTransferManager PoolAcceptFile 00001FA2D5EBE8A8458584D0CAF266E860BC,message=Creating new entry for 00001FA2D5EBE8A8458584D0CAF266E860BC

January 23rd 2022, 23:37:52.000 door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWR4CYFZA RemoteTransferManager PoolAcceptFile 00003CA3E034836040D9AB387F9FB6204997,message=Creating new entry for 00003CA3E034836040D9AB387F9FB6204997

doorinfo :

-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
client         | lcg-se1.sfu.computecanada.ca
mappedgid      | 124
mappeduid      | 3327
owner          | atlagrid
path           | /pnfs/in2p3.fr/data/atlas/atlasmctape/mc16_13TeV/AOD/e8338_e7400_s3126_r10201_r10210/mc16_13TeV.700409.ShEG_2211_Wenu_01NLO_unbiased_SMT.merge.AOD.e8338_e7400_s3126_r10201_r10210_tid27555308_00/AOD.27555308._000023.pool.root.1_1642984770
queuedtime     | 46
connectiontime | 7200046
action         | request
cellname       | RemoteTransferManager@srmDomain
datestamp      | 2022-01-24 01:39:37.584+01
errorcode      | 24
errormessage   | java.io.IOException: timed out while waiting for mover reply
pnfsid         | 00001FA2D5EBE8A8458584D0CAF266E860BC
transaction    | door:RemoteTransferManager@srmDomain:1642984777584-9856732
fqan           |

-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
client         | lcg-se1.sfu.computecanada.ca
mappedgid      | 124
mappeduid      | 3327
owner          | atlagrid
path           | /pnfs/in2p3.fr/data/atlas/atlasmctape/mc16_13TeV/AOD/e8338_e7400_s3126_r10201_r10210/mc16_13TeV.700409.ShEG_2211_Wenu_01NLO_unbiased_SMT.merge.AOD.e8338_e7400_s3126_r10201_r10210_tid27555308_00/AOD.27555308._000023.pool.root.1_1642977465
queuedtime     | 50
connectiontime | 7200051
action         | request
cellname       | RemoteTransferManager@srmDomain
datestamp      | 2022-01-23 23:37:52.047+01
errorcode      | 24
errormessage   | java.io.IOException: timed out while waiting for mover reply
pnfsid         | 00003CA3E034836040D9AB387F9FB6204997
transaction    | door:RemoteTransferManager@srmDomain:1642977472047-9841701

The webdav.mover.timeout is set to 7200000. Sometimes during a big tape import activity and mostly when transfers come from the US where traffic is often slow, the pools fail to respond to the door within the 7200sec. But it should not be a problem for dCache right?

kofemann commented 2 years ago

ah, ok. I was thinking that those are canceled flushes....

ageorget commented 2 years ago

Hi,

Ok, I think the CANCELED transfers are not related to this issue. I have the same problem this morning on another ATLAS Tape import pool.

The queue for flush still indicates 1 active request but there's no flush process in progress. No CANCELED transfers reported by the mover ls command this time.

flush ls
Class                 Active   Error  Last/min  Requests    Failed
datatape:atlas@osm         1       0      1664       763         0

mover ls
nothing

queue ls queue -l
               Class@Hsm : datatape:atlas@osm
 Expiration rest/defined : -107010 / 0   seconds
 Pending    rest/defined : 772 / 0
 Size       rest/defined : 6113343353791 / 0
 Active Store Procs.     : 1
  000009D0521F6FDE4495946B9DF8FF76F14B  osm     datatape:atlas      -                   
  0000A4B3EFD6229A498A8E17D022DBE975CC  osm     datatape:atlas      -                   
  000095FC79E6D3D44A34B63079CCCCB8A504  osm     datatape:atlas      -                   
  000081D950628F7947F59C7AA438DAFA8206  osm     datatape:atlas      -                   
...

The pool pinboard : 
28 Jan 2022 10:25:14 [pool-atlas-tapebuffer-li419a-1222] [6wc:J6E admin] Flushing SCI=datatape:atlas@osm;def=false;exp=0;pend=0;maxTotal=0;waiting=766

The manual flush is working :

> flush pnfsid 00006A099B706D8342DEB150BD2722639E8F
28 Jan 2022 10:29:23 [pool-32-thread-516] [6wc:J6E admin] Flushed 00006A099B706D8342DEB150BD2722639E8F to nearline storage: osm://osm/?store=cchpssdcache:/hpss5/dcache/atlas/datatape/2022/01/00006A099B706D8342DEB150BD2722639E8F&group=atlagrid&bfid=

Not the flush class :

flush class osm datatape:atlas
(3) java.lang.IllegalArgumentException: Is already active. from flush class

The pool log file with threads dump : https://mycore.core-cloud.net/index.php/s/w3gXOmlasrC8m7W

I also tried to increase c:puts from 1 to 5 to check if a flush request is blocking the pool flush but nothing.

Tell me if you want me to check something before I restart the pool. Adrien

ageorget commented 2 years ago

Hi, Same issue happened many time on the other dCache instance (6.2.38) during a large tape import activity by CTA. The symptoms are the same, the flush is stuck with some active store jobs and no way to find what and why :

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   2/0
    from store 0/0
    delete     0/0

pool log file : https://mycore.core-cloud.net/index.php/s/A69mFQUV1OwIUHN

kofemann commented 2 years ago

@ageorget do you see those flush processes on the pool?

ageorget commented 2 years ago

No, no more flush processes on the pool. I didn't make any changes in our hsm script. The only change was to upgrade from 5.2 to 6.2. When sometimes flush fail for many reasons (like Input/output error) this is correctly handle by the HSM script and dCache. And another point, CTA only use SRM/gridftp for their transfers so this is not related to a particular protocol.

kofemann commented 2 years ago

In the log I can see the following record

08 Feb 2022 02:06:37 (pool-cta-tape-li404a) [5de80647-3a9a-4f3e-ba8a-0b2a4d7cec81 000005515735DB2942C1BA5021371B6476C5] HSM script failed (script reported: 1: sh: line 0: kill: (28254) - No such process;)
08 Feb 2022 02:06:37 (pool-cta-tape-li404a) [] Flush of 000005515735DB2942C1BA5021371B6476C5 failed with: CacheException(rc=1;msg=HSM script failed (script reported: 1: sh: line 0: kill: (28254) - No such process;)).

Is it one of those files?

ageorget commented 2 years ago

I don't think so. I can't see any canceled or active mover/store requests. When I restarted the pool, no broken files or anything else appeared :

08 Feb 2022 10:48:48 (pool-cta-tape-li404a) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
08 Feb 2022 10:48:48 (pool-cta-tape-li404a) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
08 Feb 2022 10:48:48 (pool-cta-tape-li404a) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data/pool-cta-tape-li404a/pool/data;meta=/data/pool-cta-tape-li404a/pool/meta]]].
08 Feb 2022 10:48:51 (pool-cta-tape-li404a) [] Pool mode changed to enabled

The pool files are ok, no file in error state. Since the restart, I already had some failed flush (as always) but nothing is stuck. Everything runs well :

08 Feb 2022 10:48:51 (pool-cta-tape-li404a) [] Pool mode changed to enabled
08 Feb 2022 11:06:17 (pool-cta-tape-li404a) [c3c3cb63-9888-40d7-b5d1-67bb5a01017f 0000C44963A6900C4CE28BF72997A08E2B19] HSM script failed (script reported: 1: sh: line 0: kill: (138790) - No such process;)
08 Feb 2022 11:06:17 (pool-cta-tape-li404a) [] Flush of 0000C44963A6900C4CE28BF72997A08E2B19 failed with: CacheException(rc=1;msg=HSM script failed (script reported: 1: sh: line 0: kill: (138790) - No such process;)).
08 Feb 2022 11:10:15 (pool-cta-tape-li404a) [5fb852a8-6c68-4248-b3e3-66b5622a2b0e 0000AA137E63B96D4ECA9715C006C4506C6D] HSM script failed (script reported: 1: sh: line 0: kill: (141912) - No such process;)
08 Feb 2022 11:10:15 (pool-cta-tape-li404a) [] Flush of 0000AA137E63B96D4ECA9715C006C4506C6D failed with: CacheException(rc=1;msg=HSM script failed (script reported: 1: sh: line 0: kill: (141912) - No such process;)).
08 Feb 2022 11:32:31 (pool-cta-tape-li404a) [23af9a43-9b5a-41c2-aa86-5b2a2b719d6c 000012BB532E18634477997E90E8780C6F52] HSM script failed (script reported: 1: sh: line 0: kill: (158822) - No such process;)
08 Feb 2022 11:32:31 (pool-cta-tape-li404a) [] Flush of 000012BB532E18634477997E90E8780C6F52 failed with: CacheException(rc=1;msg=HSM script failed (script reported: 1: sh: line 0: kill: (158822) - No such process;)).
--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   0/0
    from store 0/0
    delete     0/0

[ccdcamcli10] (pool-cta-tape-li404a@ccdcacli404-pool-cta-tape-li404a-Domain) ageorget > flush ls
Class                 Active   Error  Last/min  Requests    Failed
ageorget commented 2 years ago

I downgraded some tape import pools to 6.2.17 to check if the new QueueStat mecanism introduced in 6.2.18 is responsible for this bug [1] I observed that only tape pools with many (>1000) flush requests in queue are affected by this problem. Currently Atlas and CTA who are running lot of tape imports. This is a major issue for us as we have to restart some stuck tape pools each day before they get full.

[1] https://github.com/dcache/dcache/commit/d98a894414d504d3b9e4a33dff683ed85efad9b1

kofemann commented 2 years ago

Thanks for the info. I don't think that the QueueStat affects this, but you never now.. Can you produce a thread dump when a pool runs into such situation?

ageorget commented 2 years ago

I sent you a pool log file with threads dump here https://github.com/dCache/dcache/issues/6426#issuecomment-1024034958

Here is a new one for an Atlas import tape pool : https://box.in2p3.fr/index.php/s/PaHpfyXekisxCqk

--- storagehandler (Nearline storage manager) ---
  Job Queues (active/queued)
    to store   2/0

I also found this error related to QueueStat but it doesn't seem to be link with that bug :

08 Feb 2022 04:27:42 (pool-atlas-tapebuffer-li308a) [] Unexpected failure during state change notification
java.lang.IllegalArgumentException: null
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:121)
        at org.dcache.pool.nearline.NearlineStorageHandler$QueueStat.<init>(NearlineStorageHandler.java:150)
        at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequest.incCancelFromActive(NearlineStorageHandler.java:675)
        at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequest.cancel(NearlineStorageHandler.java:581)
        at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequestContainer.cancel(NearlineStorageHandler.java:794)
        at org.dcache.pool.nearline.NearlineStorageHandler.stateChanged(NearlineStorageHandler.java:464)
        at org.dcache.pool.repository.v5.StateChangeListeners.lambda$stateChanged$0(StateChangeListeners.java:60)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
ageorget commented 2 years ago

Well, I didn't see any flush problem with the CTA pools downgraded to 6.2.17 and their tape migration activity is still important (71TB and 15k files migrated in the last 24h).

kofemann commented 2 years ago

Thanks good to know. This is not quite clear to me how this change affects the flushing, But yes, might be a race or missing notification. :man_shrugging:

kofemann commented 2 years ago

@ageorget can we do an experiment: start a single flush process, then cancel it with st kill and check that it gone with st ls and process is gone as well.

ageorget commented 2 years ago

It looks good :

[ccdcamcli06] (pool-cms-hpssdata-li275a@ccdcatli275-pool-cms-hpssdata-li275a-Domain) ageorget > st ls
dfa028f5-b645-4964-815c-6967f37a59cf ACTIVE Mon Feb 14 14:52:06 CET 2022 Mon Feb 14 14:52:06 CET 2022 000073A538D5D9404798988D6D5D9DD26575 hpssdata:cms
[ccdcamcli06] (pool-cms-hpssdata-li275a@ccdcatli275-pool-cms-hpssdata-li275a-Domain) ageorget > st kill 000073A538D5D9404798988D6D5D9DD26575
Kill initialized
[ccdcamcli06] (pool-cms-hpssdata-li275a@ccdcatli275-pool-cms-hpssdata-li275a-Domain) ageorget > st ls
[ccdcamcli06] (pool-cms-hpssdata-li275a@ccdcatli275-pool-cms-hpssdata-li275a-Domain) ageorget > flush ls
Class                 Active   Error  Last/min  Requests    Failed
hpssdata:cms@osm           0       1         0         1         0

14 Feb 2022 14:49:06 [flush] [] Flushing SCI=hpssdata:cms@osm;def=false;exp=0;pend=0;maxTotal=0;waiting=1
14 Feb 2022 14:49:20 [pool-32-thread-1053] [] Flushed 0000859002D31B2A4D1A9B0B09AC161CF143 to nearline storage: osm://osm/?store=cchpssdcache:/hpss3/dcache/cms/hpssdata/2022/02/0000859002D31B2A4D1A9B0B09AC161CF143&group=cmsgrid&bfid=
14 Feb 2022 14:52:06 [flush] [] Flushing SCI=hpssdata:cms@osm;def=false;exp=0;pend=0;maxTotal=0;waiting=1
14 Feb 2022 14:52:10 [pool-32-thread-1054] [dfa028f5-b645-4964-815c-6967f37a59cf 000073A538D5D9404798988D6D5D9DD26575] HSM script failed (script reported: 1: )
14 Feb 2022 14:52:10 [pool-32-thread-1054] [] Flush of 000073A538D5D9404798988D6D5D9DD26575 failed with: CacheException(rc=10006;msg=Flush was cancelled.).

And no more flush process

ageorget commented 2 years ago

Hi, Any news about this issue? The WLCG Tape Challenge will happen in the middle of March and I'm thinking to downgrade all dCache import tape pools to 6.2.17 to avoid any trouble. The CTA pools downgraded to 6.2.17 last month never had any flush issue contrary to others pools in 6.2.35.

kofemann commented 2 years ago

Hi @ageorget ,

we are working on this. Unfortunately, any attempts to reproduce it in a lab environment have failed.

kofemann commented 2 years ago

Hi @ageorget ,

probably you have already mentioned that, but I can't find it. Then flushes are stuck what is the result of st ls command and pool's info command. Does of entries returned by list matches the number of shown as to store. In which state does stores?

ageorget commented 2 years ago

Hi @kofemann when the flush is stuck st lscommand result is empty, the info command (like flush ls) shows incoherent values for to store(hsm c:puts=1 but storagehandler returned more than 1 active store req) :

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   3/0
    from store 0/0
    delete     0/0
kofemann commented 2 years ago

Thanks. Looks like I got a reproduser. 🤞

kofemann commented 2 years ago

@ageorget Not a 100% fix for issues at your site, but probably a partial fix of it is ready. Are you able to run one pool with a patched rpm?

ageorget commented 2 years ago

@kofemann Yes sure.

kofemann commented 2 years ago

@ageorget great! Can you give a try to https://sas.desy.de/index.php/s/9QjHskp7xSrEsNb

ageorget commented 2 years ago

I just deployed it on 2 ATLAS/CMS import tape pools. I'll keep you inform. Thanks for the patch!

kofemann commented 2 years ago

@ageorget Any update? Do you still get flushes to stuck?

ageorget commented 2 years ago

Hi @kofemann I didn't see any problem with the pools upgraded with your patch. So I think it fixed that bug.

kofemann commented 2 years ago

Hi @ageorget ,

The todays (15.03.2022) series of bug fix releases includs the fix. I am closing this issue. Feel free to reopen it if you observe the problem again (hopefully not).

ageorget commented 2 years ago

Thanks a lot for the fix, I'm going to upgrade all tape flush pools with this last version. Cheers

ageorget commented 2 years ago

Hi @kofemann

Well I'm afraid this is not completely solved... :( This happened again this morning on an Atlas flush pool upgraded to 6.2.42

image

Same symptoms, flush ls and info report 1 active store job but there's no flush process active

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   1/0
    from store 0/0

[ccdcamcli06] (pool-atlas-tapebuffer-li419a@ccdcatli419-pool-atlas-tapebuffer-li419a-Domain) ageorget > flush ls
Class                 Active   Error  Last/min  Requests    Failed
mctape:atlas@osm           1       3       542      1602         0

Last flushes look ok :

30/03/22 10:28:57 INFO     51996 ['put', '0000A8FDC60E892F484A8254B4A77FADF135', '/data/pool-atlas-tapebuffer-li419a/pool/data/0000A8FDC60E892F484A8254B4A77FADF135', '-si=size=9742656955;new=true;stored=false;sClass=datatape:atlas;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/in2p3.fr/data/atlas/atlasdatatape/data16_13TeV/AOD/r13286_p4910/data16_13TeV.00309440.physics_Main.merge.AOD.r13286_p4910_tid28101879_00/AOD.28101879._001265.pool.root.1;uid=3327;StorageGroup=cchpssdcache:0;writeToken=104;gid=124;StoreName=datatape;SpaceTokenDescription=ATLASDATATAPE;links=000016541AD069D44D3C82873581B174DB87 AOD.28101879._001265.pool.root.1;SpaceToken=104;LinkGroupId=10;flag-c=1:f0275fea;store=datatape;group=atlas;bfid=<Unknown>;', '-pnfs=/pnfs/in2p3.fr', '-hsmfs=/hpss5/dcache', '-remove=/opt/software/rfio-hpss/prod/bin/rfrm', '-method=treqs', '-c:remove=1', '-copy=/opt/software/rfio-hpss/prod/bin/rfcp']
30/03/22 10:28:57 INFO     51996 sudo -u atlagrid /opt/software/rfio-hpss/prod/bin/rfcp /data/pool-atlas-tapebuffer-li419a/pool/data/0000A8FDC60E892F484A8254B4A77FADF135 cchpssdcache:/hpss5/dcache/atlas/datatape/2022/03/0000A8FDC60E892F484A8254B4A77FADF135
30/03/22 10:29:12 INFO     51996 9742656955 bytes in 15 seconds through local (in) and p1p2 (out) (634287 KB/sec)

30/03/22 10:30:45 INFO     52249 ['put', '0000D6D60B33D9534F0CA34727022CD6266D', '/data/pool-atlas-tapebuffer-li419a/pool/data/0000D6D60B33D9534F0CA34727022CD6266D', '-si=size=9992233574;new=true;stored=false;sClass=datatape:atlas;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/in2p3.fr/data/atlas/atlasdatatape/data16_13TeV/AOD/r13286_p4910/data16_13TeV.00309440.physics_Main.merge.AOD.r13286_p4910_tid28101879_00/AOD.28101879._001044.pool.root.1;uid=3327;StorageGroup=cchpssdcache:0;writeToken=104;gid=124;StoreName=datatape;SpaceTokenDescription=ATLASDATATAPE;links=000072434D487E7D4AA78F692E6D8E553D6E AOD.28101879._001044.pool.root.1;SpaceToken=104;LinkGroupId=10;flag-c=1:82857d3e;store=datatape;group=atlas;bfid=<Unknown>;', '-pnfs=/pnfs/in2p3.fr', '-hsmfs=/hpss5/dcache', '-remove=/opt/software/rfio-hpss/prod/bin/rfrm', '-method=treqs', '-c:remove=1', '-copy=/opt/software/rfio-hpss/prod/bin/rfcp']
30/03/22 10:30:45 INFO     52249 sudo -u atlagrid /opt/software/rfio-hpss/prod/bin/rfcp /data/pool-atlas-tapebuffer-li419a/pool/data/0000D6D60B33D9534F0CA34727022CD6266D cchpssdcache:/hpss5/dcache/atlas/datatape/2022/03/0000D6D60B33D9534F0CA34727022CD6266D
30/03/22 10:30:59 INFO     52249 9992233574 bytes in 14 seconds through local (in) and p1p2 (out) (697002 KB/sec)

30/03/22 10:31:00 INFO     52278 ['put', '00004939B544FD0A49A2918B47CE62AF1D34', '/data/pool-atlas-tapebuffer-li419a/pool/data/00004939B544FD0A49A2918B47CE62AF1D34', '-si=size=9870634478;new=true;stored=false;sClass=datatape:atlas;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/in2p3.fr/data/atlas/atlasdatatape/data16_13TeV/AOD/r13286_p4910/data16_13TeV.00309440.physics_Main.merge.AOD.r13286_p4910_tid28101879_00/AOD.28101879._001342.pool.root.1;uid=3327;StorageGroup=cchpssdcache:0;writeToken=104;gid=124;StoreName=datatape;SpaceTokenDescription=ATLASDATATAPE;links=0000F6B9091B4F614D9DA1CB5FC1DECBB0DE AOD.28101879._001342.pool.root.1;SpaceToken=104;LinkGroupId=10;flag-c=1:93fee672;store=datatape;group=atlas;bfid=<Unknown>;', '-pnfs=/pnfs/in2p3.fr', '-hsmfs=/hpss5/dcache', '-remove=/opt/software/rfio-hpss/prod/bin/rfrm', '-method=treqs', '-c:remove=1', '-copy=/opt/software/rfio-hpss/prod/bin/rfcp']
30/03/22 10:31:00 INFO     52278 sudo -u atlagrid /opt/software/rfio-hpss/prod/bin/rfcp /data/pool-atlas-tapebuffer-li419a/pool/data/00004939B544FD0A49A2918B47CE62AF1D34 cchpssdcache:/hpss5/dcache/atlas/datatape/2022/03/00004939B544FD0A49A2918B47CE62AF1D34
30/03/22 10:31:14 INFO     52278 9870634478 bytes in 14 seconds through local (in) and p1p2 (out) (688520 KB/sec)

I only see some flush failed because Atlas sometimes delete the file before it's flush on tape 30 Mar 2022 10:36:45 (pool-atlas-tapebuffer-li419a) [] Flush of 0000C3842C1E514E461A9CCA0853D4691D78 failed with: CacheException(rc=10001;msg=File not found in name space during pre-flush check.).

Here's the pool log including a threads dump : https://mycore.core-cloud.net/index.php/s/Ew728CrL8sfXCgn

Adrien

ageorget commented 2 years ago

This issue is also affecting staging queues, the queue count shows active "from store" job but there's nothing reported by rh ls or mover ls and no staging process alive :

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   0/0
    from store 4/0
    delete     0/0

[ccdcamcli06] (pool-atlas-read-li352a@ccdcacli352-pool-atlas-read-li352a-Domain) ageorget > mover ls
[ccdcamcli06] (pool-atlas-read-li352a@ccdcacli352-pool-atlas-read-li352a-Domain) ageorget > rh ls
[ccdcamcli06] (pool-atlas-read-li352a@ccdcacli352-pool-atlas-read-li352a-Domain) ageorget > 

This is less problematic because the pools are configured with hundred of active staging movers but still.

Adrien

ageorget commented 2 years ago

Hi, Any updates with this bug? We still got stuck flush pools for Atlas and CMS and we need to restart them every week. The Job Queues counter remains incoherent for restore and staging for example :

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   0/0
    from store 13/0
    delete     0/0

[ccdcamcli06] (pool-atlas-read-li423a@ccdcacli423-pool-atlas-read-li423a-Domain) ageorget > rh ls
[ccdcamcli06] (pool-atlas-read-li423a@ccdcacli423-pool-atlas-read-li423a-Domain) ageorget > 

image

Active restores without any active restore process alive : image

kofemann commented 2 years ago

Hi @ageorget ,

Unfortunately, no updates yet. We are aware of the issue (and observe it locally as well), but haven't found the cause of it (yet).

ageorget commented 1 year ago

Hi @kofemann

Any chance to get an update about this issue which still affects us?

On our 7.2.17 dCache instance, I found some new error messages when a pool got stuck with flush : Flush of 0000xxx failed with: java.lang.IllegalArgumentException: Negative number of active requests.

Nov 03 16:16:14 ccdcacli364 dcache@ccdcacli364-pool-xenon-tape-li364a-Domain[133549]: 03 Nov 2022 16:16:14 (pool-xenon-tape-li364a) [door:gridftp-ccdcacli392-AAXsklrmmuA@gridftp-ccdcacli392Domain gridftp-ccdcacli392-AAXsklrmmuA PoolAcceptFile 0000A64DB43A5B3F49BD94B95412F61
Nov 03 16:16:14 ccdcacli364 dcache@ccdcacli364-pool-xenon-tape-li364a-Domain[133549]: 03 Nov 2022 16:16:14 (pool-xenon-tape-li364a) [door:gridftp-ccdcacli392-AAXsklrmmuA@gridftp-ccdcacli392Domain gridftp-ccdcacli392-AAXsklrmmuA PoolAcceptFile 0000A64DB43A5B3F49BD94B95412F61
Nov 03 16:16:28 ccdcacli364 dcache@ccdcacli364-pool-xenon-tape-li364a-Domain[133549]: 03 Nov 2022 16:16:28 (pool-xenon-tape-li364a) [] Flush of 00008A8DEB7D2F8F46619C5DE60A0363F5E6 failed with: CacheException(rc=10001;msg=File not found in name space during pre-flush check.
Nov 03 16:16:29 ccdcacli364 sudo[9252]:     root : TTY=unknown ; PWD=/usr/share/dcache ; USER=xenongdr ; COMMAND=/opt/software/rfio-hpss/prod/bin/rfcp /data/pool-xenon-tape-li364a/pool/data/0000DB511E11B17E49C08D15A86CC2BAF4DB cchpssdcache:/hpss/dcache/xenon/tape/2022/11/00
Nov 03 16:16:30 ccdcacli364 dcache@ccdcacli364-pool-xenon-tape-li364a-Domain[133549]: 03 Nov 2022 16:16:30 (pool-xenon-tape-li364a) [] Flush of 0000DB511E11B17E49C08D15A86CC2BAF4DB failed with: java.lang.IllegalArgumentException: Negative number of active requests.

Flush failed for this file on dCache side even if the file has been successfully migrated to the HSM by the HSM driver :

03/11/22 16:16:28 INFO     9248 ['put', '0000DB511E11B17E49C08D15A86CC2BAF4DB', '/data/pool-xenon-tape-li364a/pool/data/0000DB511E11B17E49C08D15A86CC2BAF4DB', '-si=size=265514396;new=true;stored=false;sClass=tape:xenon;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/in2p3.fr/data/xenon.biggrid.nl/rucio/xnt_048436/35/3e/raw_records-rfzvpzj4mf-000289;uid=42680;StorageGroup=cchpssdcache:0;gid=1086;StoreName=tape;links=000057C25A4FA6ED499AA07FFE97BFBFE920 raw_records-rfzvpzj4mf-000289;flag-c=1:27e5e56f;store=tape;group=xenon;bfid=<Unknown>;', '-pnfs=/pnfs/in2p3.fr', '-hsmfs=/hpss/dcache', '-remove=/opt/software/rfio-hpss/prod/bin/rfrm', '-method=treqs', '-c:remove=1', '-copy=/opt/software/rfio-hpss/prod/bin/rfcp']
03/11/22 16:16:28 INFO     9248 sudo -u xenongdr /opt/software/rfio-hpss/prod/bin/rfcp /data/pool-xenon-tape-li364a/pool/data/0000DB511E11B17E49C08D15A86CC2BAF4DB cchpssdcache:/hpss/dcache/xenon/tape/2022/11/0000DB511E11B17E49C08D15A86CC2BAF4DB
03/11/22 16:16:30 INFO     9248 265514396 bytes in 1 seconds through local (in) and p1p1 (out) (259291 KB/sec)

Job Queues incoherence :

  Job Queues (active/queued)
    to store   2/0
    from store 0/0
    delete     0/0

> flush ls
Class                 Active   Error  Last/min  Requests    Failed
tape:xenon@osm             1       1      1070      5046         0

Adrien

kofemann commented 1 year ago

HI @ageorget

my apologies for slow responding. We will try to find and fix the issue as fast as possible.

kofemann commented 1 year ago

Hi @ageorget . The good new is that I can reproduce it 🚀 . Hopefully, I can fix it as well.

ageorget commented 1 year ago

Hi @kofemann, Very good news, I'm waiting for the fix to test it !

kofemann commented 1 year ago

Hi @ageorget I hope we have the fix.

kofemann commented 1 year ago

Hi @ageorget ,

We assume that latest 7.2.27 fixes the issue.

https://www.dcache.org/old/downloads/1.9/release-notes-7.2.shtml#release7.2.27

ageorget commented 1 year ago

Hi @kofemann

Thanks a lot, I just upgraded 3 problematic tape pools with the 7.2.27 release to test it 🤞

ageorget commented 1 year ago

Hi @kofemann

Well I can confirm this issue is fixed! The storagehandler values for "to store" Job Queues stay coherent and no more stuck flush since the upgrade to 7.2.27. Thanks again for fixing that!

Cheers

kofemann commented 1 year ago

Hi @ageorget ,

I am really please to hear that. The problem was not obvious and, unfortunately, took us longer to fix. Thanks for your patience.