dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
284 stars 135 forks source link

Phantom queued restore (v6.2.32) #6248

Open onnozweers opened 2 years ago

onnozweers commented 2 years ago

Dear dCache devs,

Since our upgrade from 6.0 to 6.2.32, we have a pool that says it has one queued restore:

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 604800 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   0/0
    from store 0/1
    delete     0/0

But the list of restores rh ls is empty.

The queued restore is also listed in dCacheView:

image

I restarted the pool, but this strange situation persists. Occasionally I see an active restore and after a while it disappears, but not this single queued one.

Here is the HSM definition; the same as the other pools in the poolgroup.

osm(osm):script
   command             /opt/d-cache/jobs/dmfcp.sh
   pnfs                /pnfs/ftpBase
   hsmBase             /cxfs/TIER1SC/tape_storage
   c:puts              0
   c:gets              900
   c:removes           50

I've never seen this before, and only one pool behaves like this. What could be going on?

Kind regards, Onno

paulmillar commented 2 years ago

In order to support pools many very many concurrent requests, the numbers you see in the info command and on the admin web-interface are cached. This summary data (number of active and queued requests) is updated when something changes in the pool, but without scanning the list of stage activity. In contrast, the fh ls command goes through the list of stage requests: the command's output is not cached.

Therefore, I imagine the inconsistency comes from the cached summary data being out-of-sync with the requests.

That said, I've currently no idea how this could happen and (in particular) how this inconsistency is still present after restarting the pool.

Could you check if any problems are reported in the pool log file?

onnozweers commented 2 years ago

Here's (part of) the pool log, starting with the pool restart:

11 Nov 2021 10:11:18 (eel9_atlastape) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Shutdown
11 Nov 2021 10:11:23 (System) [] Still waiting for eel9_atlastape to shut down.
11 Nov 2021 10:11:26 (System) [] Killed eel9_atlastape in 9086 ms
11 Nov 2021 10:11:33 (System) [] Invalid config event received: {server.2=feszoo2:2888:3888:participant, server.1=feszoo1:2888:3888:partici
pant, server.5=feszoo5:2888:3888:participant, server.4=feszoo4:2888:3888:participant, server.3=feszoo3:2888:3888:participant, version=0}
11 Nov 2021 10:11:33 (System) [] Invalid config event received: {server.2=feszoo2:2888:3888:participant, server.1=feszoo1:2888:3888:partici
pant, server.5=feszoo5:2888:3888:participant, server.4=feszoo4:2888:3888:participant, server.3=feszoo3:2888:3888:participant, version=0}
11 Nov 2021 10:11:38 (eel9_atlastape) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
11 Nov 2021 10:11:38 (eel9_atlastape) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
11 Nov 2021 10:11:38 (eel9_atlastape) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/space/atlas/tape/pool/data;meta=/space/
atlas/tape/pool/meta]]].
11 Nov 2021 10:11:40 (eel9_atlastape) [] Pool mode changed to enabled
11 Nov 2021 10:44:17 (eel9_atlastape) [door:GFTP-hake7-AAXQgC-e2SA@ftp-hake7Domain GFTP-hake7-AAXQgC-e2SA PoolAcceptFile 0000B59ED861D76047
87A0CB05660162940A] Incomplete transfer; details follow:
    Cause: java.lang.InterruptedException
    Direction: UPLOAD
    Mode: E (Extended)
        Connection role: PASSIVE
        Listening on: [2001:610:108:203a::1:59]:24851
        Closed connections: 0
        EOF flag: not received
    Transferred: 0 B
    Mover started: 2021-11-11 10:44:17.570 (109 ms ago)
    Last received data: never
11 Nov 2021 10:44:17 (eel9_atlastape) [door:GFTP-hake7-AAXQgC-e2SA@ftp-hake7Domain GFTP-hake7-AAXQgC-e2SA PoolAcceptFile 0000B59ED861D7604787A0CB05660162940A] Transfer forcefully killed: Active transfer cancelled: killed by door: Aborting transfer due to session termination
11 Nov 2021 10:44:17 (eel9_atlastape) [door:GFTP-hake7-AAXQgC-e2SA@ftp-hake7Domain GFTP-hake7-AAXQgC-e2SA PoolAcceptFile 0000B59ED861D7604787A0CB05660162940A] Transfer failed in post-processing: File size mismatch (expected=5159052972, actual=0)
11 Nov 2021 10:44:17 (eel9_atlastape) [door:GFTP-hake7-AAXQgC-e2SA@ftp-hake7Domain GFTP-hake7-AAXQgC-e2SA PoolAcceptFile 0000B59ED861D7604787A0CB05660162940A] Failed to read file size: java.nio.file.NoSuchFileException: /space/atlas/tape/pool/data/0000B59ED861D7604787A0CB056601:
paulmillar commented 2 years ago

Ah, that's interesting.

It looks like there's a client attempting to read a file that is (seemingly) not currently available on the pool.

The file appears to be /space/atlas/tape/pool/data/0000B59ED861D7604787A0CB056601 with PNFS-ID 0000B59ED861D7604787A0CB05660162940A.

Could you check whether or not that file is present on the pool (rep ls), and whether pnfsmanager shows that file on the pool (cacheinfoof)?

onnozweers commented 2 years ago

No trace of the file:

[root@dcmain ~]# dcache-admin-command PnfsManager 'cacheinfoof 0000B59ED861D7604787A0CB05660162940A'
cacheinfoof failed: path [0000B59ED861D7604787A0CB05660162940A] does not exist

[root@dcmain ~]# dcache-admin-command PnfsManager 'pathfinder 0000B59ED861D7604787A0CB05660162940A'
(3) CacheException(rc=10011;msg=path [0000B59ED861D7604787A0CB05660162940A] does not exist) from pathfinder

From the billing files:

[root@dcmain ~]# grep '0000B59ED861D7604787A0CB05660162940A' /var/lib/dcache/billing/2021/11/billing-2021.11.*
/var/lib/dcache/billing/2021/11/billing-2021.11.11:11.11 10:44:17 [door:GFTP-hake7-AAXQgC-e2SA@ftp-hake7Domain:request] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management":36487:31306:2001:1458:301:cd:0:0:100:1e3] [0000B59ED861D7604787A0CB05660162940A,5159052972] [/pnfs/grid.sara.nl/data/atlas/atlasmctape/mc16_13TeV/AOD/e7954_e7400_s3126_r10724_r10726/mc16_13TeV.830011.H7EG_jetjet_JZ1.merge.AOD.e7954_e7400_s3126_r10724_r10726_tid21264878_00/AOD.21264878._002664.pool.root.1_1636623824] atlas:tape@osm 424 0 {451:"Aborting transfer due to session termination"}
/var/lib/dcache/billing/2021/11/billing-2021.11.11:11.11 10:44:17 [pool:eel9_atlastape:transfer] [0000B59ED861D7604787A0CB05660162940A,0] [/pnfs/grid.sara.nl/data/atlas/atlasmctape/mc16_13TeV/AOD/e7954_e7400_s3126_r10724_r10726/mc16_13TeV.830011.H7EG_jetjet_JZ1.merge.AOD.e7954_e7400_s3126_r10724_r10726_tid21264878_00/AOD.21264878._002664.pool.root.1_1636623824] atlas:tape@osm 0 0 true {GFtp-2.0 145.100.33.89 37909} [door:GFTP-hake7-AAXQgC-e2SA@ftp-hake7Domain:1636623857300000] {666:"Transfer forcefully killed: Active transfer cancelled: killed by door: Aborting transfer due to session termination"}
/var/lib/dcache/billing/2021/11/billing-2021.11.11:11.11 10:44:17 [pool:eel9_atlastape@eel9_atlastapeDomain:remove] [0000B59ED861D7604787A0CB05660162940A,0] [Unknown] atlas:tape@osm {0:"Transfer failed and replica is empty"}
paulmillar commented 2 years ago

Thanks for checking, Onno.

Sorry, my mistake. The client was trying to write the file (not read it). The redirect didn't work (for whatever reason), so the client aborted the upload.

It could be that this "rogue queued stage request" problem is triggered by such client interactions, or it could be just a red herring.

Just on the off-chance: is there anything special about the pool showing these symptoms (maybe in the pool setup file)?

onnozweers commented 2 years ago

The setup file looks normal. I compared it with 3 other pools, nothing special. Here is the setup file:

[root@eel9 /var/log]# cat /space/atlas/stage/pool/setup
#
# Created by eel9_atlasstage(Pool) at Thu Mar 19 13:06:41 CET 2020
#
csm set checksumtype adler32
csm set policy -scrub=off
csm set policy -onread=off -onwrite=on -onflush=off -onrestore=on -enforcecrc=on -getcrcfromhsm=off
#
# Flushing Thread setup
#
flush set max active 1000
flush set interval 60
flush set retry delay 60
flush set queue order FIFO
#
# Nearline storage
#
hsm create osm osm script -command=/opt/d-cache/jobs/dmfcp.sh -pnfs=/pnfs/ftpBase -hsmBase=/cxfs/TIER1SC/tape_storage -c:puts=0 -c:gets=900 -c:removes=50
mover queue create regular -order=LIFO
mover queue create p2p -order=LIFO
mover queue create gridftp_queue -order=LIFO
mover set max active -queue=regular 100
mover set max active -queue=p2p 10
mover set max active -queue=gridftp_queue 24
jtm set timeout -queue=regular -lastAccess=72000 -total=604800
jtm set timeout -queue=gridftp_queue -lastAccess=3600 -total=14400
#
# MigrationModule
#
set heartbeat 30
set report remove on
set breakeven 0.7
set mover cost factor 0.5
set duplicate request none
#
#  Pool to Pool (P2P)
#
set max diskspace 8169816M
set gap 10G
rh set timeout 604800
st set timeout 14400
rm set timeout 14400
sweeper reclaim margin 0.0

I compared also the info output from this pool with a few other pools and it looks very similar.

elenamplanas commented 2 years ago

We're running dCache 6.2.33 and we could see a similar behaviour.

In our case the "phantom" process is an store, and it's active. This value is stuck for more than one month.

We've the same behaviour in different pools, and there are also phantom active restores.

Is there a way to force a reload of this information?

elenamplanas commented 4 months ago

Hi, we're still facing this problem, now in 9.2.8. This behavior appears after a high load on stores and/or restores.