cea-hpc / robinhood

Robinhood Policy Engine : a versatile tool to monitor filesystem contents and schedule actions on filesystem entries.
http://robinhood.sf.net
Other
177 stars 60 forks source link

Some rename changelogs do not update fileclasses #140

Open thiell opened 2 months ago

thiell commented 2 months ago

In some case, with Lustre 2.16, Robinhood 3.1.7 + patches from GerritHub (see our branch here https://github.com/stanford-rc/robinhood/ ). With Lustre changelogs, RENME/RNMTO enabled, a rename does not always update fileclasses. It's quite frequent when using MinIO on top of Lustre as each file uploaded is renamed to its final destination after the upload is complete.

Example, we had a fileclass like this, that we use to exclude files from a policy:

FileClass miniosys {
    definition { tree == "/elm/*/*/*/*/minio/*/*/.minio.sys" }
}

Files are first created within .minio.sys/ so they get the miniosys fileclass at first, but then after a rename, occasionally (but quite often with MinIO), they keep the miniosys fileclass after the rename:

     file,             new,   97.92 MB, minio_p-srcc, elm_p-srcc, mr+p-srcc+minio_n2+miniosys+mr_srcc_minio_n2, /elm/stanford/mr/projects/srcc/minio/n2/disk0/sherlock-groups-weekly/eewhite.tar/e5c64363-61fc-4aa0-9616-b8339a25e30e/part.16
# lfs path2fid /elm/stanford/mr/projects/srcc/minio/n2/disk0/sherlock-groups-weekly/eewhite.tar/e5c64363-61fc-4aa0-9616-b8339a25e30e/part.16
[0x280000c5c:0x1942e:0x0]

Full logs with this FID:

2024/06/20 13:24:04 [1197304/3] ChangeLog | elm-MDT0002: 59965585 01CREAT 1718915044.658417518 0x0 t=[0x280000c5c:0x1942e:0x0] p=[0x280000c5c:0x1942d:0x0] part.16
2024/06/20 13:24:07 [1197304/3] ChangeLog | elm-MDT0002: 59965586 17MTIME 1718915047.297580513 0x6 t=[0x280000c5c:0x1942e:0x0]
2024/06/20 13:24:07 [1197304/3] ChangeLog | elm-MDT0002: 59965587 11CLOSE 1718915047.297599588 0xc2 t=[0x280000c5c:0x1942e:0x0]
2024/06/20 13:24:07 [1197304/3] ChangeLog | elm-MDT0002: 59965588 08RENME 1718915047.343062670 0x0 t=[0:0x0:0x0] p=[0x280000c5c:0x155ad:0x0] part.16 s=[0x280000c5c:0x1942e:0x0] sp=[0x280000c5c:0x1942d:0x0] part.16
2024/06/20 13:24:07 [1197304/3] ChangeLog | Rename: object=[0x280000c5c:0x1942e:0x0], old parent/name=[0x280000c5c:0x1942d:0x0]/part.16, new parent/name=[0x280000c5c:0x155ad:0x0]/part.16
2024/06/20 13:24:14 [1197304/17] EntryProc | [0x280000c5c:0x1942e:0x0]: run_all_cl_cb=none
2024/06/20 13:24:14 [1197304/17] EntryProc | RECORD: CREAT [0x280000c5c:0x1942e:0x0] 0 part.16 => getstripe=1, getattr=1, getpath=1, readlink=0, getstatus()
2024/06/20 13:24:17 [1197304/16] EntryProc | RECORD: RENME [0x280000c5c:0x1942e:0x0] 0 part.16 => getstripe=0, getattr=0, getpath=0, readlink=0, getstatus()
2024/06/20 13:24:17 [1197304/16] EntryProc | Parent dir for entry [0x280000c5c:0x1942e:0x0] is unknown (parent: [0x280000c5c:0x1942d:0x0], child name: 'part.16'): updating entry path info
2024/06/20 13:24:17 [1197304/16] EntryProc | [0x280000c5c:0x1942e:0x0]: run_all_cl_cb=none
2024/06/20 13:24:17 [1197304/16] EntryProc | RECORD: RNMTO [0x280000c5c:0x1942e:0x0] 0 part.16 => getstripe=0, getattr=1, getpath=0, readlink=0, getstatus()

I've been trying to troubleshoot this issue like this without success for now. Simple rename cases just work, but when used at scale with MinIO, it seems to be a race condition happening where the fileclasses are not updated. One thing with MinIO is that the file is created within .minio.sys within a temporary directory, that is also deleted just after the file is renamed to its final destination. Thus, when the rename changelog is processed, its parent dir does not exist anymore. This could be why we're seeing this race with MinIO a lot and perhaps why "Parent dir for entry [0x280000c5c:0x1942e:0x0] is unknown" is shown here, but I am not 100% sure.

I am opening this ticket to keep track of this issue but I will use a workaround for now, by including tree != "/elm/*/*/*/*/minio/*/*/.minio.sys" within the policy condition itself and not use a fileclass for this.

Note that a full scan is a way to fix the fileclasses.