Lustre to Lustre Data Movement can fail on DataIn due to `mknod()` error

bdevcich commented 1 month ago

When performing data movement during DataIn, a recursive copy in into an ephemeral file system can run into a race condition. This happens intermittently and I do not have a way to make this occur besides running data movement tests. It will eventually run into this.

The dcp details are tracked in an mpifileutils issue: https://github.com/hpc/mpifileutils/issues/574

This only occurs when the ephemeral lustre filesystem spans multiple rabbits.

Another thing to point out is that this issue was not seen on internal HPE systems when running TOSS 4.6.6. A recent upgrade to TOSS 4.7.6 happened around the same timeframe as this issue surfacing. Not sure it's related or not but perhaps something changed with ZFS/Lustre to cause this issue.

bdevcich commented 1 month ago

An example allocationSet for the ephemeral lustre:

status:
  allocationSets:
  - label: ost
    storage:
      rabbit-node-1:
        allocationSize: 5368709120
      rabbit-node-2:
        allocationSize: 5368709120
  - label: mdt
    storage:
      rabbit-node-1:
        allocationSize: 0
      rabbit-node-2:
        allocationSize: 0

External mgs is being used.

bdevcich commented 3 weeks ago

I've added a 30 second pause after the mount of ephemeral lustre in the DataIn stage. This seems to have caused the issue to go away. To me, that indicates a lustre issue. I will remove the pause and grab lustre logs from the rabbit nodes to see if there are any breadcrumbs.

NearNodeFlash / NearNodeFlash.github.io

Lustre to Lustre Data Movement can fail on DataIn due to `mknod()` error #161