RSE-Cambridge / data-acc

Data Accelerator: Creates a burst buffer from generic hardware and integrates it with Slurm https://www.hpc.cam.ac.uk/research/data-acc http://www.stackhpc.com
https://rse-cambridge.github.io/data-acc
Apache License 2.0
17 stars 11 forks source link

Fix a race condition by checking the presence of partitions #123

Closed jsteel44 closed 4 years ago

jsteel44 commented 4 years ago

Wait for partitions to appear in /dev Make sure they are block devices before running mkfs

jsteel44 commented 4 years ago

This could fix #120 as I suspect mkfs was creating an OST as a file in /dev/ (before the partition appeared) and that caused persistent problems with those block devices. This may also be what we were seeing in #119 if the file system was actually on these "files" (not block devices) in /dev. At least, I haven't seen these issues since applying this fix.

jsteel44 commented 4 years ago

This may need some more thought, as I just witnessed this:

brw-rw---- 1 root disk       259, 16 Jan 14 14:53 /dev/nvme7n1
-rw-r--r-- 1 root root 1559361290240 Jan 14 13:55 /dev/nvme7n1p2
# file /dev/nvme7n1p2
/dev/nvme7n1p2: Linux rev 1.0 ext4 filesystem data, UUID=49630272-b740-4797-a3be-d971c7c08651, volume name "lusuaWUt:OST0001" (extents) (large files) (huge files)

So something is still able to create an OST there (in /dev not on the block device) despite the "wait_for" and "test" clause before running mkfs.lustre.