NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

Use more reliable way to validate backend disk of mountpath #158

Closed and-1 closed 9 months ago

and-1 commented 9 months ago

I have few disks on each node for data. Sometimes when system reboot disk name has changed (e.x. was /dev/sdd become /dev/sdb). Mountpath and disk binding the same as before, but with new disk name. This situation unexpected for ais target and it does not start with error: [storage integrity error sie#50, for troubleshooting see https://github.com/NVIDIA/aistore/blob/master/docs/troubleshooting.md]: lost or missing mountpath "/ais/sdc" ({Fs:/dev/sdb FsType:xfs FsID:2064,0 } vs {Path:/ais/sdc Fs:/dev/sdd FsType:xfs FsID:2096,0 Ext:<nil> Enabled:true})

/dev/sd[a-z] not reliable reference to disk. I think more correct way to use /dev/disk/by-id/ or /dev/disk/by-uuid/ to bind disk to mountpath

alex-aizman commented 9 months ago

can you provide logs - e.g.:

$ ais log get cluster

and attach all of them

and-1 commented 9 months ago

Now my target pods up and running, but I attached logs with problem received by kubectl logs -p \<pod> ais-target.tar.gz

JFYI: I attach data to target pods using pv and local static provisioner. PV created by hand and seems like that:

apiVersion: v1
kind: PersistentVolume
spec:
  local:
    fsType: xfs
    path: /dev/disk/by-id/wwn-0x5002538e7210943e
...
alex-aizman commented 9 months ago
daemon.go:177 Version 3.11.24dd15eb ...

We are at v3.21 right now. Version 3.11 that you are running is not simply old, it is very old. I'd suggest to make the transition and use the latest release, if possible.

Secondly and separately:

Error code 50 above (sie#50) corresponds to the following check:

I don't think the error code has changed but the code that handles it certainly has (changed). Feel free to give it a shot.