dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
277 stars 132 forks source link

Migration move jobs random flag verification #7550

Open cfgamboa opened 2 months ago

cfgamboa commented 2 months ago

Dear all,

As it was reported today in the Tier1 dev meeting. Our DMZ pools have uses migration move jobs to distribute files to TAPE and DISK ONLY poolgroups. The following is an example of the migration job used to move files from DMZ pools to TAPE like pools on a pool group.

migration move -storage=bnlt1d0:BNLT1D0 -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- DATATAPE-write

migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- MCTAPE\-write

There 16 DMZ which are enabled/configured in a similar way.

Attached a picture of the pool monitor, this corresponds to a period in which the DMZ pools are saturated ( many TAPE files awaiting to be move to the internal TAPE pool groups)

image

It is not clear why there is a few pools chosen as a destination from the migration jobs?

This situation was first observed when we used the default set form the -select parameter.

I was expecting a more distributed allocation of destination pools from the TAPE diskgroup.

Could you please advise?

All the best, Carlos

lemora commented 2 months ago

Hi Carlos.

Could you please provide more information on these jobs via migration info? Is there anything logged on the origin pools or in PoolManager, were these other pool ever tried?

Thanks. Lea

cfgamboa commented 2 months ago

Hello Lea,

 migration info 179
Command    : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- MCTAPE\-write
State      : SLEEPING
Queued     : 0
Attempts   : 2929
Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10,dc258_10
Completed  : 2821 files; 4344642122070 bytes; 100%
Total      : 4344642122070 bytes
Concurrency: 40
Running tasks:
Most recent errors:
08:26:06 [4655] 0000FF579F67CAAF40D7926FBE1A57B40250: File does not exist, skipped
08:26:16 [4660] 00009A193BE526A244ECB444F4A210EC56A1: Transfer to [dc269_10@local] failed (No such file or directory: 00009A193BE526A244ECB444F4A210EC56A1); will not be retried

Carlos

cfgamboa commented 2 months ago

@lemora here there is an example were the selection goes to one pool

    Command    : migration move -storage=bnlt1d0:BNLT1D0 -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- DATATAPE\-write
    State      : RUNNING
    Queued     : 0
    Attempts   : 1731
    Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10,dc258_10
    Completed  : 1675 files; 7823832367778 bytes; 98%
    Total      : 7968129878941 bytes
    Concurrency: 40
    Running tasks:
    [16785] 000038B8C629722342969DA89EFF9978416D: TASK.Copying -> [dc258_10@local]
    [16899] 0000AAC5B024EE984B5B8C9C748D0384C90C: TASK.Copying -> [dc258_10@local]
    [16928] 0000550312C67B0F499BA575AE34B0E82E03: TASK.Copying -> [dc258_10@local]
    [16937] 000091015BAE1D12491BB97546EE57906F20: TASK.Copying -> [dc258_10@local]
    [16994] 0000330AEAE002D942B3BCBA526AEDCF96D5: TASK.Copying -> [dc258_10@local]
    [17031] 0000476DD01AE9554AD9A9F1338A983C7F8A: TASK.Copying -> [dc258_10@local]
    [17351] 0000276485F3CE9349648672BCC6E65684BA: TASK.Copying -> [dc258_10@local]
    [17447] 0000F6747705F7CA4946BF641F828ED7007F: TASK.Copying -> [dc258_10@local]
    [17459] 0000132D43F0281D45D0B9481DDFC2F1D790: TASK.Copying -> [dc258_10@local]
    [17472] 0000156D3D980FB744CB85AF804115C5BD8E: TASK.Copying -> [dc258_10@local]
    [17651] 00005430A7E0A45F479DA1C7E0E3C4F80338: TASK.Copying -> [dc258_10@local]
    [17930] 00005EE96E2A319644B6B0152F19A9DD8790: TASK.Copying -> [dc258_10@local]
    [18300] 0000AE20E0EDEC8D4EC08538C148ED24A892: TASK.Copying -> [dc258_10@local]
    [18617] 00001390D214813F44449AFCFD9D9B855EDC: TASK.Copying -> [dc253_10@local]
    [18752] 00003BD558A54ADA430C81FBE2AAB170042B: TASK.Copying -> [dc258_10@local]
    [18764] 000011532DE9D5FA468E8516C38055CB6DD5: TASK.Copying -> [dc258_10@local]
    [18993] 00002D3A5F89BF85471BA65B6873D4C9B8C5: TASK.Copying -> [dc258_10@local]
    [19047] 00007415AE95AD1A4F649E83CDD9BD6FB8F7: TASK.Copying -> [dc258_10@local]
    [19125] 000029FC4B8489D2476FAD4EB078DE636875: TASK.Copying -> [dc258_10@local]
    [19171] 00002E6EF98D3FEA45B7B2ECC957866F22AA: TASK.Copying -> [dc258_10@local]
    [19257] 0000BCBBB24C9AD147AFB847CE43AA2E7327: TASK.Copying -> [dc253_10@local]
    [19293] 0000588897B48404491E9D2289658255D90C: TASK.Copying -> [dc253_10@local]
    [19329] 0000ECB0152EEDF34A4C9FB38E0DC5CDFF24: TASK.Copying -> [dc253_10@local]
    [19336] 0000C263D49A3AC74CC4B6F37E12A99F9F8D: TASK.Copying -> [dc258_10@local]

Many Migration jobs select the same pool

image
cfgamboa commented 2 months ago

Only when I cancel the on going migration stuck (hot pool) and exclude the HOT pool from the migration job destination the destination pools for transfers started to be more diverse.

    Command    : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -exclude=dc258_10 -target=pgroup -- MCTAPE\-write
    State      : RUNNING
    Queued     : 380
    Attempts   : 103
    Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10
    Completed  : 63 files; 108070950853 bytes; 6%
    Total      : 1766407491660 bytes
    Concurrency: 40
    Running tasks:
    [20366] 0000FF68EBD77BA84E39B94455CC0B90DF0A: TASK.Copying -> [dc254_10@local]
    [20368] 00003467F90D4D524A488AC8EC789E18C780: TASK.Copying -> [dc246_10@local]
    [20370] 0000580E1B24CF104896A7C8F5D03DDA3CDA: TASK.Copying -> [dc254_10@local]
    [20373] 00009ECC758E07F840E182D55601B52023AA: TASK.Copying -> [dc266_10@local]
    [20374] 000009980CDD21E840D39B7AE2DD21A4F49C: TASK.Copying -> [dc254_10@local]
    [20375] 0000D64A5282E0AD499D90A3036B3D685FFD: TASK.Copying -> [dc249_10@local]
    [20379] 0000352929A645D2463C8518E351980B498B: TASK.Copying -> [dc249_10@local]
    [20381] 00005FC09C36DB52460092C2F963589CC22E: TASK.Copying -> [dc253_10@local]
    [20393] 0000D5BF282B5C864C729F927122C279F551: TASK.Copying -> [dc264_10@local]
    [20399] 0000D276AB59EE5C4036B249AAFBD503EE0C: TASK.Copying -> [dc253_10@local]
    [20404] 00003DECF626706E41AF861BFF261AB69EAC: TASK.Copying -> [dc259_10@local]
    [20407] 00007417BDEE533C4E449DF48EA5C64F3469: TASK.Copying -> [dc249_10@local]
    [20411] 0000FAC0B45CF886425EAE11BE8F64672F69: TASK.Copying -> [dc254_10@local]
    [20413] 0000626112F23C7A4362B9F184C907A70C6E: TASK.Copying -> [dc254_10@local]
    [20426] 0000F9B07024EA5F4765883F4D9BCECE51C2: TASK.Copying -> [dc255_10@local]
    [20428] 0000E91CF3FEA9104B96AD086714F246EA23: TASK.Copying -> [dc254_10@local]
    [20435] 000042579D86E46B4E0290D34A723BA4AC46: TASK.Copying -> [dc265_10@local]
    [20436] 00007E30937DF40F432BBE6B3164C2AEACFF: TASK.Copying -> [dc254_10@local]
    [20437] 00005BCEE90C79DD46359F2A7AD05398585D: TASK.Copying -> [dc268_10@local]
    [20438] 00006E2857CB031042788240A9C7B45F85DB: TASK.Copying -> [dc245_10@local]
    [20449] 0000ABF2340E29364D09831942BF148445C5: TASK.Copying -> [dc254_10@local]
    [20453] 000067DE01BD61BC487CA29E91AA53E4958C: TASK.Copying -> [dc266_10@local]
    [20454] 000000D23598EEA2447996B47C0660E30B26: TASK.Copying -> [dc253_10@local]
    [20456] 0000091F04719AFC4984A1DA08753086629B: TASK.Copying -> [dc264_10@local]
    [20458] 0000CBAF5B648EEC4EA481366A8B87543CEF: TASK.Copying -> [dc254_10@local]
    [20460] 0000A9E9C70532074B918B264289E5039DAF: TASK.Copying -> [dc254_10@local]
    [20461] 00004E9436A3151142C6B2C4F3F31CA2DB1B: TASK.Copying -> [dc253_10@local]
    [20463] 00000D635C6F2DBA488EAB74383AD976E361: TASK.Copying -> [dc252_10@local]
    [20464] 00009DC49ADFF6724B56BF306C26F117E626: TASK.Copying -> [dc264_10@local]
    [20466] 0000F74DCE54C5A74217B31AC75F109B0E61: TASK.Copying -> [dc263_10@local]
    [20467] 0000893BA234596A4575B837A0ADDD4A45E7: TASK.Copying -> [dc270_10@local]
    [20468] 0000083A69109D5D4DA6BB5AF9B06F2C3CCA: TASK.Copying -> [dc268_10@local]
    [20471] 000094EBE42BE8174FD7B2079B967032FC06: TASK.Copying -> [dc261_10@local]
    [20472] 0000F1745EF967904B318C9595AF24BD6527: TASK.Copying -> [dc260_10@local]
    [20476] 0000267757BADBCB4BC9AACD99196F606619: TASK.Copying -> [dc248_10@local]
    [20477] 000018D1DEC8F9A1472F8A93B70F7C3B8C70: TASK.Copying -> [dc245_10@local]
    [20478] 0000B8D1CC202EF74E1EA28E45760D8A72A4: TASK.Copying -> [dc245_10@local]
    [20491] 0000B22DD9631AF14FADB85A59AD701F9A9D: TASK.Copying -> [dc245_10@local]
    [20492] 00006A94B6B915564D6390226D8987B7F95E: TASK.Copying -> [dc267_10@local]
    [20493] 00002FF8EE11B9634689997D73AE2FAABFF5: TASK.Copying -> [dc246_10@local]
kofemann commented 2 months ago

Is it possible that the migration was going on, but you see only stack tasks in the output?

-kofemann /* caffeinated mutations of the core personality /

On Thu, Apr 18, 2024 at 2:30 PM gamboa @.***> wrote:

Only when I cancel the on going migration stuck (hot pool) and exclude the HOT pool from the migration job destination the destination pools for transfers started to be more diverse.

Command    : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -exclude=dc258_10 -target=pgroup -- MCTAPE\-write
State      : RUNNING
Queued     : 380
Attempts   : 103
Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10
Completed  : 63 files; 108070950853 bytes; 6%
Total      : 1766407491660 bytes
Concurrency: 40
Running tasks:
[20366] 0000FF68EBD77BA84E39B94455CC0B90DF0A: TASK.Copying -> ***@***.***
[20368] 00003467F90D4D524A488AC8EC789E18C780: TASK.Copying -> ***@***.***
[20370] 0000580E1B24CF104896A7C8F5D03DDA3CDA: TASK.Copying -> ***@***.***
[20373] 00009ECC758E07F840E182D55601B52023AA: TASK.Copying -> ***@***.***
[20374] 000009980CDD21E840D39B7AE2DD21A4F49C: TASK.Copying -> ***@***.***
[20375] 0000D64A5282E0AD499D90A3036B3D685FFD: TASK.Copying -> ***@***.***
[20379] 0000352929A645D2463C8518E351980B498B: TASK.Copying -> ***@***.***
[20381] 00005FC09C36DB52460092C2F963589CC22E: TASK.Copying -> ***@***.***
[20393] 0000D5BF282B5C864C729F927122C279F551: TASK.Copying -> ***@***.***
[20399] 0000D276AB59EE5C4036B249AAFBD503EE0C: TASK.Copying -> ***@***.***
[20404] 00003DECF626706E41AF861BFF261AB69EAC: TASK.Copying -> ***@***.***
[20407] 00007417BDEE533C4E449DF48EA5C64F3469: TASK.Copying -> ***@***.***
[20411] 0000FAC0B45CF886425EAE11BE8F64672F69: TASK.Copying -> ***@***.***
[20413] 0000626112F23C7A4362B9F184C907A70C6E: TASK.Copying -> ***@***.***
[20426] 0000F9B07024EA5F4765883F4D9BCECE51C2: TASK.Copying -> ***@***.***
[20428] 0000E91CF3FEA9104B96AD086714F246EA23: TASK.Copying -> ***@***.***
[20435] 000042579D86E46B4E0290D34A723BA4AC46: TASK.Copying -> ***@***.***
[20436] 00007E30937DF40F432BBE6B3164C2AEACFF: TASK.Copying -> ***@***.***
[20437] 00005BCEE90C79DD46359F2A7AD05398585D: TASK.Copying -> ***@***.***
[20438] 00006E2857CB031042788240A9C7B45F85DB: TASK.Copying -> ***@***.***
[20449] 0000ABF2340E29364D09831942BF148445C5: TASK.Copying -> ***@***.***
[20453] 000067DE01BD61BC487CA29E91AA53E4958C: TASK.Copying -> ***@***.***
[20454] 000000D23598EEA2447996B47C0660E30B26: TASK.Copying -> ***@***.***
[20456] 0000091F04719AFC4984A1DA08753086629B: TASK.Copying -> ***@***.***
[20458] 0000CBAF5B648EEC4EA481366A8B87543CEF: TASK.Copying -> ***@***.***
[20460] 0000A9E9C70532074B918B264289E5039DAF: TASK.Copying -> ***@***.***
[20461] 00004E9436A3151142C6B2C4F3F31CA2DB1B: TASK.Copying -> ***@***.***
[20463] 00000D635C6F2DBA488EAB74383AD976E361: TASK.Copying -> ***@***.***
[20464] 00009DC49ADFF6724B56BF306C26F117E626: TASK.Copying -> ***@***.***
[20466] 0000F74DCE54C5A74217B31AC75F109B0E61: TASK.Copying -> ***@***.***
[20467] 0000893BA234596A4575B837A0ADDD4A45E7: TASK.Copying -> ***@***.***
[20468] 0000083A69109D5D4DA6BB5AF9B06F2C3CCA: TASK.Copying -> ***@***.***
[20471] 000094EBE42BE8174FD7B2079B967032FC06: TASK.Copying -> ***@***.***
[20472] 0000F1745EF967904B318C9595AF24BD6527: TASK.Copying -> ***@***.***
[20476] 0000267757BADBCB4BC9AACD99196F606619: TASK.Copying -> ***@***.***
[20477] 000018D1DEC8F9A1472F8A93B70F7C3B8C70: TASK.Copying -> ***@***.***
[20478] 0000B8D1CC202EF74E1EA28E45760D8A72A4: TASK.Copying -> ***@***.***
[20491] 0000B22DD9631AF14FADB85A59AD701F9A9D: TASK.Copying -> ***@***.***
[20492] 00006A94B6B915564D6390226D8987B7F95E: TASK.Copying -> ***@***.***
[20493] 00002FF8EE11B9634689997D73AE2FAABFF5: TASK.Copying -> ***@***.***

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7550#issuecomment-2063753611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEMTXJROYQ2IMC3HUGIDMDY564EZAVCNFSM6AAAAABGLPFSP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRTG42TGNRRGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

cfgamboa commented 2 months ago

The migration from source does not stop. The problem here is that it chooses the same destination pool. It does not seem to be a pure random process.

kofemann commented 2 months ago

@cfgamboa can you check in billing and confirm that all p2p when into one pool and all others get less traffic? Or on average the data distribution is flat.

cfgamboa commented 2 months ago

Hi

It seems that disabling the random flag helps to spread out the load to the poolgroup.

Carlos

On Apr 23, 2024, at 11:03 AM, Tiramisu Mokka @.***> wrote:

@cfgamboa https://github.com/cfgamboa can you check in billing and confirm that all p2p when into one pool and all others get less traffic? Or on average the data distribution is flat.

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7550#issuecomment-2072606469, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO3BZHALFPC3LNGCT6DY6ZZ25AVCNFSM6AAAAABGLPFSP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZSGYYDMNBWHE. You are receiving this because you were mentioned.

DmitryLitvintsev commented 2 months ago

This is the best indication that there is load pattern that sculpts the initially random distribution. Do you have other activities on destination pools? That may sculpt the initially random distributiob. Whereas non specifying random takes pool load (and space) into account.

(example of sculpting - a slow pool will seem as "attracting" many transfers when pools are selected randomly)

cfgamboa commented 2 months ago

Yes there are other activities at the destination pools also on the DMZ pools there are other migration jobs to other pool groups