Open cfgamboa opened 2 months ago
Hi Carlos.
Could you please provide more information on these jobs via migration info
?
Is there anything logged on the origin pools or in PoolManager, were these other pool ever tried?
Thanks. Lea
Hello Lea,
migration info 179
Command : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- MCTAPE\-write
State : SLEEPING
Queued : 0
Attempts : 2929
Targets : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10,dc258_10
Completed : 2821 files; 4344642122070 bytes; 100%
Total : 4344642122070 bytes
Concurrency: 40
Running tasks:
Most recent errors:
08:26:06 [4655] 0000FF579F67CAAF40D7926FBE1A57B40250: File does not exist, skipped
08:26:16 [4660] 00009A193BE526A244ECB444F4A210EC56A1: Transfer to [dc269_10@local] failed (No such file or directory: 00009A193BE526A244ECB444F4A210EC56A1); will not be retried
Carlos
@lemora here there is an example were the selection goes to one pool
Command : migration move -storage=bnlt1d0:BNLT1D0 -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- DATATAPE\-write
State : RUNNING
Queued : 0
Attempts : 1731
Targets : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10,dc258_10
Completed : 1675 files; 7823832367778 bytes; 98%
Total : 7968129878941 bytes
Concurrency: 40
Running tasks:
[16785] 000038B8C629722342969DA89EFF9978416D: TASK.Copying -> [dc258_10@local]
[16899] 0000AAC5B024EE984B5B8C9C748D0384C90C: TASK.Copying -> [dc258_10@local]
[16928] 0000550312C67B0F499BA575AE34B0E82E03: TASK.Copying -> [dc258_10@local]
[16937] 000091015BAE1D12491BB97546EE57906F20: TASK.Copying -> [dc258_10@local]
[16994] 0000330AEAE002D942B3BCBA526AEDCF96D5: TASK.Copying -> [dc258_10@local]
[17031] 0000476DD01AE9554AD9A9F1338A983C7F8A: TASK.Copying -> [dc258_10@local]
[17351] 0000276485F3CE9349648672BCC6E65684BA: TASK.Copying -> [dc258_10@local]
[17447] 0000F6747705F7CA4946BF641F828ED7007F: TASK.Copying -> [dc258_10@local]
[17459] 0000132D43F0281D45D0B9481DDFC2F1D790: TASK.Copying -> [dc258_10@local]
[17472] 0000156D3D980FB744CB85AF804115C5BD8E: TASK.Copying -> [dc258_10@local]
[17651] 00005430A7E0A45F479DA1C7E0E3C4F80338: TASK.Copying -> [dc258_10@local]
[17930] 00005EE96E2A319644B6B0152F19A9DD8790: TASK.Copying -> [dc258_10@local]
[18300] 0000AE20E0EDEC8D4EC08538C148ED24A892: TASK.Copying -> [dc258_10@local]
[18617] 00001390D214813F44449AFCFD9D9B855EDC: TASK.Copying -> [dc253_10@local]
[18752] 00003BD558A54ADA430C81FBE2AAB170042B: TASK.Copying -> [dc258_10@local]
[18764] 000011532DE9D5FA468E8516C38055CB6DD5: TASK.Copying -> [dc258_10@local]
[18993] 00002D3A5F89BF85471BA65B6873D4C9B8C5: TASK.Copying -> [dc258_10@local]
[19047] 00007415AE95AD1A4F649E83CDD9BD6FB8F7: TASK.Copying -> [dc258_10@local]
[19125] 000029FC4B8489D2476FAD4EB078DE636875: TASK.Copying -> [dc258_10@local]
[19171] 00002E6EF98D3FEA45B7B2ECC957866F22AA: TASK.Copying -> [dc258_10@local]
[19257] 0000BCBBB24C9AD147AFB847CE43AA2E7327: TASK.Copying -> [dc253_10@local]
[19293] 0000588897B48404491E9D2289658255D90C: TASK.Copying -> [dc253_10@local]
[19329] 0000ECB0152EEDF34A4C9FB38E0DC5CDFF24: TASK.Copying -> [dc253_10@local]
[19336] 0000C263D49A3AC74CC4B6F37E12A99F9F8D: TASK.Copying -> [dc258_10@local]
Many Migration jobs select the same pool
Only when I cancel the on going migration stuck (hot pool) and exclude the HOT pool from the migration job destination the destination pools for transfers started to be more diverse.
Command : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -exclude=dc258_10 -target=pgroup -- MCTAPE\-write
State : RUNNING
Queued : 380
Attempts : 103
Targets : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10
Completed : 63 files; 108070950853 bytes; 6%
Total : 1766407491660 bytes
Concurrency: 40
Running tasks:
[20366] 0000FF68EBD77BA84E39B94455CC0B90DF0A: TASK.Copying -> [dc254_10@local]
[20368] 00003467F90D4D524A488AC8EC789E18C780: TASK.Copying -> [dc246_10@local]
[20370] 0000580E1B24CF104896A7C8F5D03DDA3CDA: TASK.Copying -> [dc254_10@local]
[20373] 00009ECC758E07F840E182D55601B52023AA: TASK.Copying -> [dc266_10@local]
[20374] 000009980CDD21E840D39B7AE2DD21A4F49C: TASK.Copying -> [dc254_10@local]
[20375] 0000D64A5282E0AD499D90A3036B3D685FFD: TASK.Copying -> [dc249_10@local]
[20379] 0000352929A645D2463C8518E351980B498B: TASK.Copying -> [dc249_10@local]
[20381] 00005FC09C36DB52460092C2F963589CC22E: TASK.Copying -> [dc253_10@local]
[20393] 0000D5BF282B5C864C729F927122C279F551: TASK.Copying -> [dc264_10@local]
[20399] 0000D276AB59EE5C4036B249AAFBD503EE0C: TASK.Copying -> [dc253_10@local]
[20404] 00003DECF626706E41AF861BFF261AB69EAC: TASK.Copying -> [dc259_10@local]
[20407] 00007417BDEE533C4E449DF48EA5C64F3469: TASK.Copying -> [dc249_10@local]
[20411] 0000FAC0B45CF886425EAE11BE8F64672F69: TASK.Copying -> [dc254_10@local]
[20413] 0000626112F23C7A4362B9F184C907A70C6E: TASK.Copying -> [dc254_10@local]
[20426] 0000F9B07024EA5F4765883F4D9BCECE51C2: TASK.Copying -> [dc255_10@local]
[20428] 0000E91CF3FEA9104B96AD086714F246EA23: TASK.Copying -> [dc254_10@local]
[20435] 000042579D86E46B4E0290D34A723BA4AC46: TASK.Copying -> [dc265_10@local]
[20436] 00007E30937DF40F432BBE6B3164C2AEACFF: TASK.Copying -> [dc254_10@local]
[20437] 00005BCEE90C79DD46359F2A7AD05398585D: TASK.Copying -> [dc268_10@local]
[20438] 00006E2857CB031042788240A9C7B45F85DB: TASK.Copying -> [dc245_10@local]
[20449] 0000ABF2340E29364D09831942BF148445C5: TASK.Copying -> [dc254_10@local]
[20453] 000067DE01BD61BC487CA29E91AA53E4958C: TASK.Copying -> [dc266_10@local]
[20454] 000000D23598EEA2447996B47C0660E30B26: TASK.Copying -> [dc253_10@local]
[20456] 0000091F04719AFC4984A1DA08753086629B: TASK.Copying -> [dc264_10@local]
[20458] 0000CBAF5B648EEC4EA481366A8B87543CEF: TASK.Copying -> [dc254_10@local]
[20460] 0000A9E9C70532074B918B264289E5039DAF: TASK.Copying -> [dc254_10@local]
[20461] 00004E9436A3151142C6B2C4F3F31CA2DB1B: TASK.Copying -> [dc253_10@local]
[20463] 00000D635C6F2DBA488EAB74383AD976E361: TASK.Copying -> [dc252_10@local]
[20464] 00009DC49ADFF6724B56BF306C26F117E626: TASK.Copying -> [dc264_10@local]
[20466] 0000F74DCE54C5A74217B31AC75F109B0E61: TASK.Copying -> [dc263_10@local]
[20467] 0000893BA234596A4575B837A0ADDD4A45E7: TASK.Copying -> [dc270_10@local]
[20468] 0000083A69109D5D4DA6BB5AF9B06F2C3CCA: TASK.Copying -> [dc268_10@local]
[20471] 000094EBE42BE8174FD7B2079B967032FC06: TASK.Copying -> [dc261_10@local]
[20472] 0000F1745EF967904B318C9595AF24BD6527: TASK.Copying -> [dc260_10@local]
[20476] 0000267757BADBCB4BC9AACD99196F606619: TASK.Copying -> [dc248_10@local]
[20477] 000018D1DEC8F9A1472F8A93B70F7C3B8C70: TASK.Copying -> [dc245_10@local]
[20478] 0000B8D1CC202EF74E1EA28E45760D8A72A4: TASK.Copying -> [dc245_10@local]
[20491] 0000B22DD9631AF14FADB85A59AD701F9A9D: TASK.Copying -> [dc245_10@local]
[20492] 00006A94B6B915564D6390226D8987B7F95E: TASK.Copying -> [dc267_10@local]
[20493] 00002FF8EE11B9634689997D73AE2FAABFF5: TASK.Copying -> [dc246_10@local]
Is it possible that the migration was going on, but you see only stack tasks in the output?
-kofemann /* caffeinated mutations of the core personality /
On Thu, Apr 18, 2024 at 2:30 PM gamboa @.***> wrote:
Only when I cancel the on going migration stuck (hot pool) and exclude the HOT pool from the migration job destination the destination pools for transfers started to be more diverse.
Command : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -exclude=dc258_10 -target=pgroup -- MCTAPE\-write State : RUNNING Queued : 380 Attempts : 103 Targets : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10 Completed : 63 files; 108070950853 bytes; 6% Total : 1766407491660 bytes Concurrency: 40 Running tasks: [20366] 0000FF68EBD77BA84E39B94455CC0B90DF0A: TASK.Copying -> ***@***.*** [20368] 00003467F90D4D524A488AC8EC789E18C780: TASK.Copying -> ***@***.*** [20370] 0000580E1B24CF104896A7C8F5D03DDA3CDA: TASK.Copying -> ***@***.*** [20373] 00009ECC758E07F840E182D55601B52023AA: TASK.Copying -> ***@***.*** [20374] 000009980CDD21E840D39B7AE2DD21A4F49C: TASK.Copying -> ***@***.*** [20375] 0000D64A5282E0AD499D90A3036B3D685FFD: TASK.Copying -> ***@***.*** [20379] 0000352929A645D2463C8518E351980B498B: TASK.Copying -> ***@***.*** [20381] 00005FC09C36DB52460092C2F963589CC22E: TASK.Copying -> ***@***.*** [20393] 0000D5BF282B5C864C729F927122C279F551: TASK.Copying -> ***@***.*** [20399] 0000D276AB59EE5C4036B249AAFBD503EE0C: TASK.Copying -> ***@***.*** [20404] 00003DECF626706E41AF861BFF261AB69EAC: TASK.Copying -> ***@***.*** [20407] 00007417BDEE533C4E449DF48EA5C64F3469: TASK.Copying -> ***@***.*** [20411] 0000FAC0B45CF886425EAE11BE8F64672F69: TASK.Copying -> ***@***.*** [20413] 0000626112F23C7A4362B9F184C907A70C6E: TASK.Copying -> ***@***.*** [20426] 0000F9B07024EA5F4765883F4D9BCECE51C2: TASK.Copying -> ***@***.*** [20428] 0000E91CF3FEA9104B96AD086714F246EA23: TASK.Copying -> ***@***.*** [20435] 000042579D86E46B4E0290D34A723BA4AC46: TASK.Copying -> ***@***.*** [20436] 00007E30937DF40F432BBE6B3164C2AEACFF: TASK.Copying -> ***@***.*** [20437] 00005BCEE90C79DD46359F2A7AD05398585D: TASK.Copying -> ***@***.*** [20438] 00006E2857CB031042788240A9C7B45F85DB: TASK.Copying -> ***@***.*** [20449] 0000ABF2340E29364D09831942BF148445C5: TASK.Copying -> ***@***.*** [20453] 000067DE01BD61BC487CA29E91AA53E4958C: TASK.Copying -> ***@***.*** [20454] 000000D23598EEA2447996B47C0660E30B26: TASK.Copying -> ***@***.*** [20456] 0000091F04719AFC4984A1DA08753086629B: TASK.Copying -> ***@***.*** [20458] 0000CBAF5B648EEC4EA481366A8B87543CEF: TASK.Copying -> ***@***.*** [20460] 0000A9E9C70532074B918B264289E5039DAF: TASK.Copying -> ***@***.*** [20461] 00004E9436A3151142C6B2C4F3F31CA2DB1B: TASK.Copying -> ***@***.*** [20463] 00000D635C6F2DBA488EAB74383AD976E361: TASK.Copying -> ***@***.*** [20464] 00009DC49ADFF6724B56BF306C26F117E626: TASK.Copying -> ***@***.*** [20466] 0000F74DCE54C5A74217B31AC75F109B0E61: TASK.Copying -> ***@***.*** [20467] 0000893BA234596A4575B837A0ADDD4A45E7: TASK.Copying -> ***@***.*** [20468] 0000083A69109D5D4DA6BB5AF9B06F2C3CCA: TASK.Copying -> ***@***.*** [20471] 000094EBE42BE8174FD7B2079B967032FC06: TASK.Copying -> ***@***.*** [20472] 0000F1745EF967904B318C9595AF24BD6527: TASK.Copying -> ***@***.*** [20476] 0000267757BADBCB4BC9AACD99196F606619: TASK.Copying -> ***@***.*** [20477] 000018D1DEC8F9A1472F8A93B70F7C3B8C70: TASK.Copying -> ***@***.*** [20478] 0000B8D1CC202EF74E1EA28E45760D8A72A4: TASK.Copying -> ***@***.*** [20491] 0000B22DD9631AF14FADB85A59AD701F9A9D: TASK.Copying -> ***@***.*** [20492] 00006A94B6B915564D6390226D8987B7F95E: TASK.Copying -> ***@***.*** [20493] 00002FF8EE11B9634689997D73AE2FAABFF5: TASK.Copying -> ***@***.***
— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7550#issuecomment-2063753611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEMTXJROYQ2IMC3HUGIDMDY564EZAVCNFSM6AAAAABGLPFSP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRTG42TGNRRGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
The migration from source does not stop. The problem here is that it chooses the same destination pool. It does not seem to be a pure random process.
@cfgamboa can you check in billing and confirm that all p2p when into one pool and all others get less traffic? Or on average the data distribution is flat.
Hi
It seems that disabling the random flag helps to spread out the load to the poolgroup.
Carlos
On Apr 23, 2024, at 11:03 AM, Tiramisu Mokka @.***> wrote:
@cfgamboa https://github.com/cfgamboa can you check in billing and confirm that all p2p when into one pool and all others get less traffic? Or on average the data distribution is flat.
— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7550#issuecomment-2072606469, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO3BZHALFPC3LNGCT6DY6ZZ25AVCNFSM6AAAAABGLPFSP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZSGYYDMNBWHE. You are receiving this because you were mentioned.
This is the best indication that there is load pattern that sculpts the initially random distribution. Do you have other activities on destination pools? That may sculpt the initially random distributiob. Whereas non specifying random takes pool load (and space) into account.
(example of sculpting - a slow pool will seem as "attracting" many transfers when pools are selected randomly)
Yes there are other activities at the destination pools also on the DMZ pools there are other migration jobs to other pool groups
Dear all,
As it was reported today in the Tier1 dev meeting. Our DMZ pools have uses
migration move
jobs to distribute files to TAPE and DISK ONLY poolgroups. The following is an example of the migration job used to move files from DMZ pools to TAPE like pools on a pool group.There 16 DMZ which are enabled/configured in a similar way.
Attached a picture of the pool monitor, this corresponds to a period in which the DMZ pools are saturated ( many TAPE files awaiting to be move to the internal TAPE pool groups)
It is not clear why there is a few pools chosen as a destination from the migration jobs?
This situation was first observed when we used the default set form the
-select
parameter.I was expecting a more distributed allocation of destination pools from the TAPE diskgroup.
Could you please advise?
All the best, Carlos