Open bdevcich opened 1 month ago
An example allocationSet
for the ephemeral lustre:
status:
allocationSets:
- label: ost
storage:
rabbit-node-1:
allocationSize: 5368709120
rabbit-node-2:
allocationSize: 5368709120
- label: mdt
storage:
rabbit-node-1:
allocationSize: 0
rabbit-node-2:
allocationSize: 0
External mgs is being used.
I've added a 30 second pause after the mount of ephemeral lustre in the DataIn stage. This seems to have caused the issue to go away. To me, that indicates a lustre issue. I will remove the pause and grab lustre logs from the rabbit nodes to see if there are any breadcrumbs.
When performing data movement during DataIn, a recursive copy in into an ephemeral file system can run into a race condition. This happens intermittently and I do not have a way to make this occur besides running data movement tests. It will eventually run into this.
The dcp details are tracked in an mpifileutils issue: https://github.com/hpc/mpifileutils/issues/574
This only occurs when the ephemeral lustre filesystem spans multiple rabbits.
Another thing to point out is that this issue was not seen on internal HPE systems when running TOSS 4.6.6. A recent upgrade to TOSS 4.7.6 happened around the same timeframe as this issue surfacing. Not sure it's related or not but perhaps something changed with ZFS/Lustre to cause this issue.