dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
288 stars 136 forks source link

Understanding info cannot find StateChangeSet for path #7588

Closed cfgamboa closed 4 months ago

cfgamboa commented 4 months ago

Hello,

This is an extract of messages logged at the info cell

he/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.links.lnk-DATATAPE-stage
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.links.lnk-MCTAPE-read
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.links.lnk-MCTAPE-stage
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.units.store.read.bnlt1d0:TESTTAPE@osm
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.units.store.read.bnlt1d0:BNLT1D0@osm
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.units.store.read.MCTAPE:MC@osm
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.units.store.read.t1d0ccrc08:T1D0CCRC08@osm
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.units.store.stage.bnlt1d0:TESTTAPE@osm
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.units.store.stage.bnlt1d0:BNLT1D0@osm
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.units.store.stage.MCTAPE:MC@osm
/var/log/dcache/dccore02Domain.log:28 May 2024 11:24:14 (info) [] cannot find StateChangeSet for path nas.lnk-DATATAPE-read,lnk-DATATAPE-stage,lnk-MCTAPE-read,lnk-MCTAPE-stage.units.store.stage.t1d0ccrc08:T1D0CCRC08@osm

At the same time range we have observed that at the time many restore request


05.28 09:35:31 [pool:dc261_12@dc261_12-Domain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 9939503 0 {0:""}
05.28 10:56:04 [pool:dc224_38@dc224thirtyeightDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 6032132 0 {0:""}
05.28 10:56:12 [pool:dc270_12@dc270_12-Domain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 5094861 0 {0:""}
05.28 10:56:13 [pool:dc235_1@dc235oneDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 5081671 0 {0:""}
05.28 11:08:04 [pool:dc241_12@dc241_12-Domain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 4332974 0 {0:""}
05.28 11:08:10 [pool:dc230_38@dc230thirtyeightDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 5095809 0 {0:""}
05.28 11:08:41 [pool:dc213_38@dc213thirtyeightDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 15171865 0 {0:""}
05.28 11:09:07 [pool:dc246_12@dc246twelveDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 4307407 0 {0:""}
05.28 11:10:34 [pool:dc234_38@dc234thirtyeightDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 9287221 0 {0:""}
05.28 11:10:40 [pool:dc260_12@dc260_12-Domain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 4185799 0 {0:""}
05.28 11:11:34 [pool:dc223_38@dc223thirtyeightDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 14100315 0 {0:""}
05.28 11:11:35 [pool:dc245_12@dc245twelveDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 4163021 0 {0:""}
05.28 11:11:37 [pool:dc228_38@dc228thirtyeightDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 10077319 0 {0:""}
05.28 11:11:53 [pool:dc265_12@dc265_12-Domain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 4102600 0 {0:""}
05.28 11:12:11 [pool:dc221_38@dc221thirtyeightDomain:restore] [0000BC92056D042A40589F4ED115D607DE74,8418268782] [Unknown] bnlt1d0:BNLT1D0@osm 13140075 0 {0:""}
05.28 13:21:33 [pool:dc230_38:transfer] [0000BC92056D042A40589F4ED115D607DE74,8418268782] \

This may lead to multiple restore request for the file.

[dccore02] (local) admin > \sn cacheinfoof 0000BC92056D042A40589F4ED115D607DE74
 dc221_38 dc223_38 dc224_38 dc228_38 dc230_38 dc234_38 dc235_1 dc241_12 dc245_12 dc246_12 dc260_12 dc261_12 dc265_12 dc270_12 

At the moment no pool restart happened.

dCache version is 9.2.17

Could you please advise?

All the best, Carlos

DmitryLitvintsev commented 4 months ago

I do not think that these info messages have anything to do with multiple restores.

These pools:

dc221_38 dc223_38 dc224_38 dc228_38 dc230_38 dc234_38 dc235_1 dc241_12 dc245_12 dc246_12 dc260_12 dc261_12 dc265_12 dc270_12 

do they all belong to different pool groups or are they all the same pool group?

How those restore got triggered? By calling TAPE REST API stage or simply by read access to file?

cfgamboa commented 4 months ago

They all share the same group. How can I query the history of the restore in TAPE REST API?

DmitryLitvintsev commented 4 months ago

Another question - are these pools shared among multiple pool groups or they belong to the same single pool group.

If you request stage of this file via rest api, does it trigger restore?

History? I would use direct SQL query on bulk db:

select * from bulk_request br inner join request_target rt on  rt.rid = br.id where rt.pnfsid = '0000BC92056D042A40589F4ED115D607DE74' order by created_at asc;

When debugging you have to try to reproduce the issue. So

  1. do you trigger restore if you just try to stage this file using REST API while the file is obviously cached. So execute gfal-brigonline
  2. do you trigger restore if you execute multiple stage calls via REST API
  3. If negative, remove that file replica (\sl 0000BC92056D042A40589F4ED115D607DE74 rep rm -force 0000BC92056D042A40589F4ED115D607DE74) and redo 1-2.

Another thing to check. Do you have multiple restores on pools currently in the system:

ssh -p 2224 admin@<admin domain host> "\s dc* rh ls" | tr -d "\r" | egrep "[A-F0-9]+{32}" | awk '{ print $(NF-1)}' | sort | uniq -c | sort -k1 -n  

Note above "dc" will match all your other well known cells starting with "dc". I advise you to use names for pools that can be easily matched separately from the rest of the system. Above command will still run, but it will obviously fail on all cells matching "dc*" pattern.

At Fermilab we use this naming scheme for pools:

rw-<host name>-N
r-<host name>-N
w-<host name>-N
v-<host name>-N
p-<host name>-N

rw,r,w,v,p standing for read-write, read-only, write-only, volatile and "persistent" - an unfortunate name for disk-only pools. then simple host name (w/o domain name) and N - number sequential number of pools among the pools running on the pool node.

DmitryLitvintsev commented 4 months ago

I have a hunch as to what is happening...... Do this:

  1. remove this file from cache or use any other file....
  2. do gfal-bringonline on that file. Verify that this file is being restored - \sp rc ls <pnfsid>.*
  3. while it is being restored do another gfal-bringonline on that file, Re-run \sp rc ls <pnfsid>.*
  4. Did anything change between output of \sp rc ls <pnfsid>.* between (2) and (3)?

Also, I think, instead of using client tool like gfal-bringonline you can just use PinManager cli , it has pin command :

(PinManager@utilityDomain) enstore > help pin 
NAME
       pin -- pin a file to disk

SYNOPSIS
       pin <pnfsid> <seconds> 

DESCRIPTION
       Pins a file to disk for some time. A file may be pinned forever
       by specifying a lifetime of -1. Pinning a file may involve
       staging it or copying it from one pool to another. For that
       reason pinning may take awhile and the pin command may time out.
       The pin request will however stay active and progress may be
       tracked by listing the pins on the file.

(PinManager@utilityDomain) enstore > 
vingar commented 4 months ago

How those restore got triggered? By calling TAPE REST API stage or simply by read access to file?

They got triggered by the TAPE REST API through FTS.

"id","pid","pnfsid","path","type","state","created_at","started_at","last_updated","error_type","error_message","rid" 18937520,1,0000BC92056D042A40589F4ED115D607DE74,/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/DAOD_RPVLL/r11969_r11784_p4072/data18_13TeV.00360373.physics_Main.merge.DAOD_RPVLL.r11969_r11784_p4072_tid21910926_00/DAOD_RPVLL.21910926._000585.pool.root.1,REGULAR,COMPLETED,2024-05-28 11:35:31,2024-05-28 11:35:31,2024-05-28 15:12:12,,,1124724

is the corresponding entry in the request_target table for the first example.

There is after the dav access on the file from FTS registered on the door info billing once FTS detects the file as staged after polling the TAPE REST API with the staging request id

DmitryLitvintsev commented 4 months ago

Thanks, Please follow up on the actions I suggested that will help to elucidate the actual problem. And run this query against bulk db:

select br.started_at, br.last_modified, br.status  from bulk_request br inner join request_target rt on  rt.rid = br.id where rt.pnfsid = '0000F216ED971C9D4F31BF1592C591AD2A5D' order by br.started_at asc;
vingar commented 4 months ago

Thanks, Please follow up on the actions I suggested that will help to elucidate the actual problem. And run this query against bulk db:

select br.started_at, br.last_modified, br.status  from bulk_request br inner join request_target rt on  rt.rid = br.id where rt.pnfsid = '0000F216ED971C9D4F31BF1592C591AD2A5D' order by br.started_at asc;

Strange. it gives nothing. The entry might have been deleted?

The id 18937520 is not in the bulk_request table.

DmitryLitvintsev commented 4 months ago

OK, this is not a problem. Try other actions I suggested.

Essence - try to pin something while it is already being PINNING (and staging) and see if

\sp rc ls <pnfsid>.*

output changes. You likely can do something like

\sp rc ls

grab any pnfsid from the list (the ones being "staging..." of course). Make note on what pool it is staging and the timestamp.

check that this pnfsid is in PINNING state by:

\s PinManager ls <pnfsid>

and if yes, do:

\s  PinManager pin <pnfsid> 3600 

then do:

\sp rc ls <pnfsid>.*

did the pool name (and timestamp) change?

cfgamboa commented 4 months ago

Hello

[dccore03] (dc260_12@dc260_12-Domain) admin > rep rm 000080AD05F2A8A0450687B8C6886FBA5E05
Removed 000080AD05F2A8A0450687B8C6886FBA5E05
[dccore03] (dc260_12@dc260_12-Domain) admin > \sn cacheinfoof 000080AD05F2A8A0450687B8C6886FBA5E05

Pinning the file first time


[dccore03] (PinManager@dccore03Domain) admin > pin 000080AD05F2A8A0450687B8C6886FBA5E05 8600

[dccore03] (local) admin > \s PoolManager@* rc ls -l 000080AD05F2A8A0450687B8C6886FBA5E05.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain:
    000080AD05F2A8A0450687B8C6886FBA5E05@internal-net-external-net-world-net-*/* m=1 r=0 [dc223_38] [Waiting for stage: dc223_38 05.30 16:11:51] {0,}
        DCap-3.0,0:0:0:0:0:0:0:1:0

Pinning the file second time

[dccore03] (local) admin > \c PinManager
[dccore03] (PinManager@dccore03Domain) admin >  pin 000080AD05F2A8A0450687B8C6886FBA5E05 8600

[dccore03] (local) admin > \s PoolManager@* rc ls -l 000080AD05F2A8A0450687B8C6886FBA5E05.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain:
    000080AD05F2A8A0450687B8C6886FBA5E05@internal-net-external-net-world-net-*/* m=2 r=0 [dc223_38] [Waiting for stage: dc223_38 05.30 16:11:51] {0,}
        DCap-3.0,0:0:0:0:0:0:0:1:0
        DCap-3.0,0:0:0:0:0:0:0:1:0

No the time did not change

cfgamboa commented 4 months ago

Issuing the Pin on a different PinManager for the third time

[dccore03] (PinManager@dccore02Domain) admin >  pin 000080AD05F2A8A0450687B8C6886FBA5E05 8600
[88208005] 000080AD05F2A8A0450687B8C6886FBA5E05 pinned until Thu May 30 18:41:16 EDT 2024
[dccore03] (PinManager@dccore02Domain) admin > \s PoolManager@* rc ls -l 000080AD05F2A8A0450687B8C6886FBA5E05.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
cfgamboa commented 4 months ago
[dccore03] (local) admin > \s PoolManager@* rc ls -l 000080AD05F2A8A0450687B8C6886FBA5E05.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
DmitryLitvintsev commented 4 months ago

So..... If this checks out OK, then the basic function works correctly, I am out of ideas. My general approach to this is to try to simplify the system. I would run with just one PoolManager and observed if you have multiple restores.

vingar commented 4 months ago

Thanks, Please follow up on the actions I suggested that will help to elucidate the actual problem. And run this query against bulk db:

select br.started_at, br.last_modified, br.status  from bulk_request br inner join request_target rt on  rt.rid = br.id where rt.pnfsid = '0000F216ED971C9D4F31BF1592C591AD2A5D' order by br.started_at asc;

Strange. it gives nothing. The entry might have been deleted?

The id 18937520 is not in the bulk_request table.

For completeness, I checked with the staging request ID recorded in FTS: 48ddd203-56d3-46c6-9aaf-2b96a37562b0. The corresponding entry in bulk_request is:

"id","uid","owner","arrived_at","started_at","last_modified","status","activity","expand_directories","url_prefix","target_prefix","clear_on_success","clear_on_failure","cancel_on_failure","prestore"
1124724,48ddd203-56d3-46c6-9aaf-2b96a37562b0,6435:31152,2024-05-28 05:35:31,2024-05-28 05:35:31,2024-05-28 09:12:12,COMPLETED,STAGE,NONE,https://dcfrontend.usatlas.bnl.gov:3880/api/v1/tape/stage,/,false,false,false,

The id (1124724) is different than the id (18937520) recorded in the request_target table. It corresponds to the rid field instead.

cfgamboa commented 4 months ago

This is another examples of files with multiple copies

oolManager    

    000096C85BC551A24D5DBBAA78314DDDD81F@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
    000094762517A93145268312509A1F3B7347@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
    0000AAABBC9D44464F82913C4AB86238938B@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
    0000EC59A6144C1F4BE0B45B15062B044657@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
    00005A235E80B00C4E9C9B3F170230223A0F@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
    00008432D99F859E44ADBBE6B0F423F054E8@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > rh ls
f827ce09-1460-4acc-a667-ed96000e9594 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 0000EC59A6144C1F4BE0B45B15062B044657 bnlt1d0:BNLT1D0
8b69b270-445a-43bc-9aa0-2c62ff1edc72 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 000094762517A93145268312509A1F3B7347 bnlt1d0:BNLT1D0
1e04cef0-4316-43c8-a88c-f59ef171d6c5 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 00005A235E80B00C4E9C9B3F170230223A0F bnlt1d0:BNLT1D0
aaae2973-9872-4f38-a89f-fbc38aee363b ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 000096C85BC551A24D5DBBAA78314DDDD81F bnlt1d0:BNLT1D0
7c467813-26ba-431b-92be-8f2c323dc399 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 00008432D99F859E44ADBBE6B0F423F054E8 bnlt1d0:BNLT1D0
ef3c83f2-a3d0-4719-80d4-34432fca5dd4 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 0000AAABBC9D44464F82913C4AB86238938B bnlt1d0:BNLT1D0

[[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 0000AAABBC9D44464F82913C4AB86238938B
 dc223_38 dc224_38 dc226_38 dc230_38 dc234_38 dc245_12 dc246_12 dc260_12 dc265_12 dc266_12 dc269_12

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 000096C85BC551A24D5DBBAA78314DDDD81F
 dc223_38 dc224_38 dc226_38 dc241_12 dc245_12 dc260_12 dc265_12 dc266_12 dc270_12

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 000094762517A93145268312509A1F3B7347
 dc223_38 dc235_1 dc245_12 dc268_12 dc269_12

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 0000AAABBC9D44464F82913C4AB86238938B
 dc223_38 dc224_38 dc226_38 dc230_38 dc234_38 dc245_12 dc246_12 dc260_12 dc265_12 dc266_12 dc269_12

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 0000EC59A6144C1F4BE0B45B15062B044657
 dc223_38 dc230_38 dc245_12 dc260_12 dc265_12 dc269_12

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 00005A235E80B00C4E9C9B3F170230223A0F
 dc227_38 dc233_38 dc234_38 dc235_1 dc240_12 dc241_12 dc245_12 dc248_12 dc263_12 dc265_12

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 00008432D99F859E44ADBBE6B0F423F054E8
 dc223_38 dc226_38 dc245_12 dc246_12 dc261_12 dc262_12 dc263_12 dc265_12 dc269_12
cfgamboa commented 4 months ago

@DmitryLitvintsev what I just posted might be related to https://github.com/dCache/dcache/issues/7587

cfgamboa commented 4 months ago

Interestedly setting the pool in rdonly mode

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > rh ls
f827ce09-1460-4acc-a667-ed96000e9594 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 0000EC59A6144C1F4BE0B45B15062B044657 bnlt1d0:BNLT1D0
8b69b270-445a-43bc-9aa0-2c62ff1edc72 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 000094762517A93145268312509A1F3B7347 bnlt1d0:BNLT1D0
1e04cef0-4316-43c8-a88c-f59ef171d6c5 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 00005A235E80B00C4E9C9B3F170230223A0F bnlt1d0:BNLT1D0
aaae2973-9872-4f38-a89f-fbc38aee363b ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 000096C85BC551A24D5DBBAA78314DDDD81F bnlt1d0:BNLT1D0
7c467813-26ba-431b-92be-8f2c323dc399 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 00008432D99F859E44ADBBE6B0F423F054E8 bnlt1d0:BNLT1D0
ef3c83f2-a3d0-4719-80d4-34432fca5dd4 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 0000AAABBC9D44464F82913C4AB86238938B bnlt1d0:BNLT1D0
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 000096C85BC551A24D5DBBAA78314DDDD81F.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 000094762517A93145268312509A1F3B7347.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 0000AAABBC9D44464F82913C4AB86238938B.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 0000EC59A6144C1F4BE0B45B15062B044657.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 00005A235E80B00C4E9C9B3F170230223A0F.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK

[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 00008432D99F859E44ADBBE6B0F423F054E8.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
DmitryLitvintsev commented 4 months ago

As I mentioned. Stop your other two PoolManagers and lets run it with one PoolManager for sometime to see if there is change in behavior. I get it you likely run multiple PnfsManagers as well?

DmitryLitvintsev commented 4 months ago

@DmitryLitvintsev what I just posted might be related to #7587

I am not sure. The #7587 is known to occur when pool go down. You say you have not seen the pools going down. BUT it is possible that they went "down" temporarily and you did not notice? Is there any messages in the pool manager log that pools go disabled randomly?

cfgamboa commented 4 months ago

No the pool disable -rdonly is a controlled operation when migrating pools/servers.

cfgamboa commented 4 months ago

Another more interesting example If I set a pool in readonly mode this while pining the file

This test is to verify the behavior of a pool set in read only mode while a file is being pin

  1. Pin the file
    [dccore01] (dc260_12@dc260_12-Domain) admin > rep rm 000014AE95EE37F74E2A976A165A95EC88F4 -force
    Removed 000014AE95EE37F74E2A976A165A95EC88F4
    [dccore01] (dc260_12@dc260_12-Domain) admin > \c PinManager
    PinManager   
    [dccore01] (dc260_12@dc260_12-Domain) admin > \c PinManager
    [dccore01] (PinManager@dccore01Domain) admin > pin 000014AE95EE37F74E2A976A165A95EC88F4 86000
  2. Request created at the pool manager
    "\s PoolManager@dccore0*Domain rc ls "|grep 000014AE95EE37F74E2A976A165A95EC88F4
    000014AE95EE37F74E2A976A165A95EC88F4@internal-net-external-net-world-net-*/* m=1 r=0 [dc264_12] [Waiting for stage: dc264_12 05.31 11:51:45] {0,}

    The pool recieves the request

[dccore01] (dc264_12@dc264_12-Domain) admin > rh ls
cb86e54a-479a-4ef2-85cb-487b419a6a9a ACTIVE Fri May 31 11:51:45 EDT 2024 Fri May 31 11:51:45 EDT 2024 000014AE95EE37F74E2A976A165A95EC88F4 bnlt1d0:BNLT1D0
  1. Pool in read only mode
[dccore01] (dc264_12@dc264_12-Domain) admin > pool disable -rdonly
Pool dc264_12 disabled(store,stage,p2p-client)
[dccore01] (dc264_12@dc264_12-Domain) admin > rh ls
cb86e54a-479a-4ef2-85cb-487b419a6a9a ACTIVE Fri May 31 11:51:45 EDT 2024 Fri May 31 11:51:45 EDT 2024 000014AE95EE37F74E2A976A165A95EC88F4 bnlt1d0:BNLT1D0

Request is active in the pool

  1. No request is observed at the PoolManager for that file
    
    [dccore01] (dc264_12@dc264_12-Domain) admin > \s PoolManager@* rc ls -l 000014AE95EE37F74E2A976A165A95EC88F4.*
    PoolManager@dccore01Domain: OK
    PoolManager@dccore02Domain: OK
    PoolManager@dccore03Domain: OK

[dccore01] (dc260_12@dc260_12-Domain) admin > \c PinManager [dccore01] (PinManager@dccore01Domain) admin > pin 000014AE95EE37F74E2A976A165A95EC88F4 86000 (3) CacheException(rc=10006;msg=Request to [>PinManager@dccore01Domain] timed out.)

5. at Pinmanger this is observed

[dccore01] (PinManager@dccore01Domain) admin > pin 000014AE95EE37F74E2A976A165A95EC88F4 86000 (3) CacheException(rc=10006;msg=Request to [>PinManager@dccore01Domain] timed out.)


6. If there a subsequent attempt of pining file

[dccore01] (PinManager@dccore01Domain) admin > pin 000014AE95EE37F74E2A976A165A95EC88F4 86000 [88216971] 000014AE95EE37F74E2A976A165A95EC88F4 pinned until Sat Jun 01 11:54:50 EDT 2024


7. The pool do not have the staging request.

[dccore01] (dc264_12@dc264_12-Domain) admin > rh ls