Closed cfgamboa closed 6 months ago
I do not think that these info messages have anything to do with multiple restores.
These pools:
dc221_38 dc223_38 dc224_38 dc228_38 dc230_38 dc234_38 dc235_1 dc241_12 dc245_12 dc246_12 dc260_12 dc261_12 dc265_12 dc270_12
do they all belong to different pool groups or are they all the same pool group?
How those restore got triggered? By calling TAPE REST API stage or simply by read access to file?
They all share the same group. How can I query the history of the restore in TAPE REST API?
Another question - are these pools shared among multiple pool groups or they belong to the same single pool group.
If you request stage of this file via rest api, does it trigger restore?
History? I would use direct SQL query on bulk db:
select * from bulk_request br inner join request_target rt on rt.rid = br.id where rt.pnfsid = '0000BC92056D042A40589F4ED115D607DE74' order by created_at asc;
When debugging you have to try to reproduce the issue. So
gfal-brigonline
\sl 0000BC92056D042A40589F4ED115D607DE74 rep rm -force 0000BC92056D042A40589F4ED115D607DE74
) and redo 1-2. Another thing to check. Do you have multiple restores on pools currently in the system:
ssh -p 2224 admin@<admin domain host> "\s dc* rh ls" | tr -d "\r" | egrep "[A-F0-9]+{32}" | awk '{ print $(NF-1)}' | sort | uniq -c | sort -k1 -n
Note above "dc" will match all your other well known cells starting with "dc". I advise you to use names for pools that can be easily matched separately from the rest of the system. Above command will still run, but it will obviously fail on all cells matching "dc*" pattern.
At Fermilab we use this naming scheme for pools:
rw-<host name>-N
r-<host name>-N
w-<host name>-N
v-<host name>-N
p-<host name>-N
rw,r,w,v,p standing for read-write, read-only, write-only, volatile and "persistent" - an unfortunate name for disk-only pools. then simple host name (w/o domain name) and N - number sequential number of pools among the pools running on the pool node.
I have a hunch as to what is happening...... Do this:
gfal-bringonline
on that file. Verify that this file is being restored - \sp rc ls <pnfsid>.*
gfal-bringonline
on that file, Re-run \sp rc ls <pnfsid>.*
\sp rc ls <pnfsid>.*
between (2) and (3)?Also, I think, instead of using client tool like gfal-bringonline
you can just use PinManager cli , it has pin command :
(PinManager@utilityDomain) enstore > help pin
NAME
pin -- pin a file to disk
SYNOPSIS
pin <pnfsid> <seconds>
DESCRIPTION
Pins a file to disk for some time. A file may be pinned forever
by specifying a lifetime of -1. Pinning a file may involve
staging it or copying it from one pool to another. For that
reason pinning may take awhile and the pin command may time out.
The pin request will however stay active and progress may be
tracked by listing the pins on the file.
(PinManager@utilityDomain) enstore >
How those restore got triggered? By calling TAPE REST API stage or simply by read access to file?
They got triggered by the TAPE REST API through FTS.
"id","pid","pnfsid","path","type","state","created_at","started_at","last_updated","error_type","error_message","rid" 18937520,1,0000BC92056D042A40589F4ED115D607DE74,/pnfs/usatlas.bnl.gov/BNLT1D0/data18_13TeV/DAOD_RPVLL/r11969_r11784_p4072/data18_13TeV.00360373.physics_Main.merge.DAOD_RPVLL.r11969_r11784_p4072_tid21910926_00/DAOD_RPVLL.21910926._000585.pool.root.1,REGULAR,COMPLETED,2024-05-28 11:35:31,2024-05-28 11:35:31,2024-05-28 15:12:12,,,1124724
is the corresponding entry in the request_target table for the first example.
There is after the dav access on the file from FTS registered on the door info billing once FTS detects the file as staged after polling the TAPE REST API with the staging request id
Thanks, Please follow up on the actions I suggested that will help to elucidate the actual problem. And run this query against bulk db:
select br.started_at, br.last_modified, br.status from bulk_request br inner join request_target rt on rt.rid = br.id where rt.pnfsid = '0000F216ED971C9D4F31BF1592C591AD2A5D' order by br.started_at asc;
Thanks, Please follow up on the actions I suggested that will help to elucidate the actual problem. And run this query against bulk db:
select br.started_at, br.last_modified, br.status from bulk_request br inner join request_target rt on rt.rid = br.id where rt.pnfsid = '0000F216ED971C9D4F31BF1592C591AD2A5D' order by br.started_at asc;
Strange. it gives nothing. The entry might have been deleted?
The id 18937520 is not in the bulk_request table.
OK, this is not a problem. Try other actions I suggested.
Essence - try to pin something while it is already being PINNING (and staging) and see if
\sp rc ls <pnfsid>.*
output changes. You likely can do something like
\sp rc ls
grab any pnfsid from the list (the ones being "staging..." of course). Make note on what pool it is staging and the timestamp.
check that this pnfsid is in PINNING state by:
\s PinManager ls <pnfsid>
and if yes, do:
\s PinManager pin <pnfsid> 3600
then do:
\sp rc ls <pnfsid>.*
did the pool name (and timestamp) change?
Hello
[dccore03] (dc260_12@dc260_12-Domain) admin > rep rm 000080AD05F2A8A0450687B8C6886FBA5E05
Removed 000080AD05F2A8A0450687B8C6886FBA5E05
[dccore03] (dc260_12@dc260_12-Domain) admin > \sn cacheinfoof 000080AD05F2A8A0450687B8C6886FBA5E05
Pinning the file first time
[dccore03] (PinManager@dccore03Domain) admin > pin 000080AD05F2A8A0450687B8C6886FBA5E05 8600
[dccore03] (local) admin > \s PoolManager@* rc ls -l 000080AD05F2A8A0450687B8C6886FBA5E05.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain:
000080AD05F2A8A0450687B8C6886FBA5E05@internal-net-external-net-world-net-*/* m=1 r=0 [dc223_38] [Waiting for stage: dc223_38 05.30 16:11:51] {0,}
DCap-3.0,0:0:0:0:0:0:0:1:0
Pinning the file second time
[dccore03] (local) admin > \c PinManager
[dccore03] (PinManager@dccore03Domain) admin > pin 000080AD05F2A8A0450687B8C6886FBA5E05 8600
[dccore03] (local) admin > \s PoolManager@* rc ls -l 000080AD05F2A8A0450687B8C6886FBA5E05.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain:
000080AD05F2A8A0450687B8C6886FBA5E05@internal-net-external-net-world-net-*/* m=2 r=0 [dc223_38] [Waiting for stage: dc223_38 05.30 16:11:51] {0,}
DCap-3.0,0:0:0:0:0:0:0:1:0
DCap-3.0,0:0:0:0:0:0:0:1:0
No the time did not change
Issuing the Pin on a different PinManager for the third time
[dccore03] (PinManager@dccore02Domain) admin > pin 000080AD05F2A8A0450687B8C6886FBA5E05 8600
[88208005] 000080AD05F2A8A0450687B8C6886FBA5E05 pinned until Thu May 30 18:41:16 EDT 2024
[dccore03] (PinManager@dccore02Domain) admin > \s PoolManager@* rc ls -l 000080AD05F2A8A0450687B8C6886FBA5E05.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
[dccore03] (local) admin > \s PoolManager@* rc ls -l 000080AD05F2A8A0450687B8C6886FBA5E05.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
So..... If this checks out OK, then the basic function works correctly, I am out of ideas. My general approach to this is to try to simplify the system. I would run with just one PoolManager and observed if you have multiple restores.
Thanks, Please follow up on the actions I suggested that will help to elucidate the actual problem. And run this query against bulk db:
select br.started_at, br.last_modified, br.status from bulk_request br inner join request_target rt on rt.rid = br.id where rt.pnfsid = '0000F216ED971C9D4F31BF1592C591AD2A5D' order by br.started_at asc;
Strange. it gives nothing. The entry might have been deleted?
The id 18937520 is not in the bulk_request table.
For completeness, I checked with the staging request ID recorded in FTS: 48ddd203-56d3-46c6-9aaf-2b96a37562b0. The corresponding entry in bulk_request is:
"id","uid","owner","arrived_at","started_at","last_modified","status","activity","expand_directories","url_prefix","target_prefix","clear_on_success","clear_on_failure","cancel_on_failure","prestore"
1124724,48ddd203-56d3-46c6-9aaf-2b96a37562b0,6435:31152,2024-05-28 05:35:31,2024-05-28 05:35:31,2024-05-28 09:12:12,COMPLETED,STAGE,NONE,https://dcfrontend.usatlas.bnl.gov:3880/api/v1/tape/stage,/,false,false,false,
The id (1124724) is different than the id (18937520) recorded in the request_target table. It corresponds to the rid field instead.
This is another examples of files with multiple copies
oolManager
000096C85BC551A24D5DBBAA78314DDDD81F@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
000094762517A93145268312509A1F3B7347@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
0000AAABBC9D44464F82913C4AB86238938B@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
0000EC59A6144C1F4BE0B45B15062B044657@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
00005A235E80B00C4E9C9B3F170230223A0F@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
00008432D99F859E44ADBBE6B0F423F054E8@internal-net-external-net-world-net-*/* m=0 r=0 [dc221_38] [Waiting for stage: dc221_38 05.28 11:12:11] {0,}
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > rh ls
f827ce09-1460-4acc-a667-ed96000e9594 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 0000EC59A6144C1F4BE0B45B15062B044657 bnlt1d0:BNLT1D0
8b69b270-445a-43bc-9aa0-2c62ff1edc72 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 000094762517A93145268312509A1F3B7347 bnlt1d0:BNLT1D0
1e04cef0-4316-43c8-a88c-f59ef171d6c5 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 00005A235E80B00C4E9C9B3F170230223A0F bnlt1d0:BNLT1D0
aaae2973-9872-4f38-a89f-fbc38aee363b ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 000096C85BC551A24D5DBBAA78314DDDD81F bnlt1d0:BNLT1D0
7c467813-26ba-431b-92be-8f2c323dc399 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 00008432D99F859E44ADBBE6B0F423F054E8 bnlt1d0:BNLT1D0
ef3c83f2-a3d0-4719-80d4-34432fca5dd4 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 0000AAABBC9D44464F82913C4AB86238938B bnlt1d0:BNLT1D0
[[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 0000AAABBC9D44464F82913C4AB86238938B
dc223_38 dc224_38 dc226_38 dc230_38 dc234_38 dc245_12 dc246_12 dc260_12 dc265_12 dc266_12 dc269_12
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 000096C85BC551A24D5DBBAA78314DDDD81F
dc223_38 dc224_38 dc226_38 dc241_12 dc245_12 dc260_12 dc265_12 dc266_12 dc270_12
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 000094762517A93145268312509A1F3B7347
dc223_38 dc235_1 dc245_12 dc268_12 dc269_12
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 0000AAABBC9D44464F82913C4AB86238938B
dc223_38 dc224_38 dc226_38 dc230_38 dc234_38 dc245_12 dc246_12 dc260_12 dc265_12 dc266_12 dc269_12
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 0000EC59A6144C1F4BE0B45B15062B044657
dc223_38 dc230_38 dc245_12 dc260_12 dc265_12 dc269_12
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 00005A235E80B00C4E9C9B3F170230223A0F
dc227_38 dc233_38 dc234_38 dc235_1 dc240_12 dc241_12 dc245_12 dc248_12 dc263_12 dc265_12
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \sn cacheinfoof 00008432D99F859E44ADBBE6B0F423F054E8
dc223_38 dc226_38 dc245_12 dc246_12 dc261_12 dc262_12 dc263_12 dc265_12 dc269_12
@DmitryLitvintsev what I just posted might be related to https://github.com/dCache/dcache/issues/7587
Interestedly setting the pool in rdonly mode
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > rh ls
f827ce09-1460-4acc-a667-ed96000e9594 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 0000EC59A6144C1F4BE0B45B15062B044657 bnlt1d0:BNLT1D0
8b69b270-445a-43bc-9aa0-2c62ff1edc72 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 000094762517A93145268312509A1F3B7347 bnlt1d0:BNLT1D0
1e04cef0-4316-43c8-a88c-f59ef171d6c5 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 00005A235E80B00C4E9C9B3F170230223A0F bnlt1d0:BNLT1D0
aaae2973-9872-4f38-a89f-fbc38aee363b ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 000096C85BC551A24D5DBBAA78314DDDD81F bnlt1d0:BNLT1D0
7c467813-26ba-431b-92be-8f2c323dc399 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 00008432D99F859E44ADBBE6B0F423F054E8 bnlt1d0:BNLT1D0
ef3c83f2-a3d0-4719-80d4-34432fca5dd4 ACTIVE Tue May 28 11:12:11 EDT 2024 Tue May 28 11:12:11 EDT 2024 0000AAABBC9D44464F82913C4AB86238938B bnlt1d0:BNLT1D0
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 000096C85BC551A24D5DBBAA78314DDDD81F.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 000094762517A93145268312509A1F3B7347.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 0000AAABBC9D44464F82913C4AB86238938B.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 0000EC59A6144C1F4BE0B45B15062B044657.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 00005A235E80B00C4E9C9B3F170230223A0F.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
[dccore01] (dc221_38@dc221thirtyeightDomain) admin > \s PoolManager@* rc ls 00008432D99F859E44ADBBE6B0F423F054E8.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
As I mentioned. Stop your other two PoolManagers and lets run it with one PoolManager for sometime to see if there is change in behavior. I get it you likely run multiple PnfsManagers as well?
@DmitryLitvintsev what I just posted might be related to #7587
I am not sure. The #7587 is known to occur when pool go down. You say you have not seen the pools going down. BUT it is possible that they went "down" temporarily and you did not notice? Is there any messages in the pool manager log that pools go disabled randomly?
No the pool disable -rdonly is a controlled operation when migrating pools/servers.
Another more interesting example If I set a pool in readonly mode this while pining the file
This test is to verify the behavior of a pool set in read only mode while a file is being pin
[dccore01] (dc260_12@dc260_12-Domain) admin > rep rm 000014AE95EE37F74E2A976A165A95EC88F4 -force
Removed 000014AE95EE37F74E2A976A165A95EC88F4
[dccore01] (dc260_12@dc260_12-Domain) admin > \c PinManager
PinManager
[dccore01] (dc260_12@dc260_12-Domain) admin > \c PinManager
[dccore01] (PinManager@dccore01Domain) admin > pin 000014AE95EE37F74E2A976A165A95EC88F4 86000
"\s PoolManager@dccore0*Domain rc ls "|grep 000014AE95EE37F74E2A976A165A95EC88F4
000014AE95EE37F74E2A976A165A95EC88F4@internal-net-external-net-world-net-*/* m=1 r=0 [dc264_12] [Waiting for stage: dc264_12 05.31 11:51:45] {0,}
The pool recieves the request
[dccore01] (dc264_12@dc264_12-Domain) admin > rh ls
cb86e54a-479a-4ef2-85cb-487b419a6a9a ACTIVE Fri May 31 11:51:45 EDT 2024 Fri May 31 11:51:45 EDT 2024 000014AE95EE37F74E2A976A165A95EC88F4 bnlt1d0:BNLT1D0
[dccore01] (dc264_12@dc264_12-Domain) admin > pool disable -rdonly
Pool dc264_12 disabled(store,stage,p2p-client)
[dccore01] (dc264_12@dc264_12-Domain) admin > rh ls
cb86e54a-479a-4ef2-85cb-487b419a6a9a ACTIVE Fri May 31 11:51:45 EDT 2024 Fri May 31 11:51:45 EDT 2024 000014AE95EE37F74E2A976A165A95EC88F4 bnlt1d0:BNLT1D0
Request is active in the pool
[dccore01] (dc264_12@dc264_12-Domain) admin > \s PoolManager@* rc ls -l 000014AE95EE37F74E2A976A165A95EC88F4.*
PoolManager@dccore01Domain: OK
PoolManager@dccore02Domain: OK
PoolManager@dccore03Domain: OK
[dccore01] (dc260_12@dc260_12-Domain) admin > \c PinManager [dccore01] (PinManager@dccore01Domain) admin > pin 000014AE95EE37F74E2A976A165A95EC88F4 86000 (3) CacheException(rc=10006;msg=Request to [>PinManager@dccore01Domain] timed out.)
5. at Pinmanger this is observed
[dccore01] (PinManager@dccore01Domain) admin > pin 000014AE95EE37F74E2A976A165A95EC88F4 86000 (3) CacheException(rc=10006;msg=Request to [>PinManager@dccore01Domain] timed out.)
6. If there a subsequent attempt of pining file
[dccore01] (PinManager@dccore01Domain) admin > pin 000014AE95EE37F74E2A976A165A95EC88F4 86000 [88216971] 000014AE95EE37F74E2A976A165A95EC88F4 pinned until Sat Jun 01 11:54:50 EDT 2024
7. The pool do not have the staging request.
[dccore01] (dc264_12@dc264_12-Domain) admin > rh ls
Hello,
This is an extract of messages logged at the info cell
At the same time range we have observed that at the time many restore request
This may lead to multiple restore request for the file.
At the moment no pool restart happened.
dCache version is 9.2.17
Could you please advise?
All the best, Carlos