NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.23k stars 164 forks source link

Sometimes failing to restart AIS in GKE: getting "lost or missing mountpath" fatal error #110

Closed colineles closed 2 years ago

colineles commented 2 years ago

Hi,

We're using AIStore in a GKE cluster using the AIS K8S operator.

We have multiple mounts specified in the current spec we're using

    mounts:
      - path: "/ais1"
        size: 1000Gi
      - path: "/ais2"
        size: 1000Gi

However over time we notice that target pods can crash and then fail to start up. The startup message indicates the following error:

FATAL ERROR: t[rlNlzeeu]: [storage integrity error sie#50, for troubleshooting see https://github.com/NVIDIA/aistore/blob/master/docs/troubleshooting.md]: lost or missing mountpath "/ais1" ({Fs:/dev/sdb FsType:ext4 FsID:543639085,1533152499} vs {Path:/ais1 Fs:/dev/sdc FsType:ext4 FsID:543639085,1533152499 Ext:<nil> Enabled:true})

It seems that the 2 disks are getting mounted to different devices (/dev/sdc and /dev/sdb) after restarting which then fails the integrity check on the start up. I can resolve this by removing .ais.vmd files (unclear if this is causing data issues yet.)

Do you have any suggestions around this issue? Is there a way to enforce which mount gets mapped to which device?

alex-aizman commented 2 years ago

https://github.com/NVIDIA/aistore/commit/938193713296d7ca95857be9b17c535dc43454cf must fix it, please try the latest master and let us know.

Separately, it'd be great to have some guidance on how to deploy on GKE. Markdown with easy steps, or something like that.

colineles commented 2 years ago

Thanks, we will test this out, are you able to release new docker images?

I can try to update some documentation for GKE, but in general we followed the README here https://github.com/NVIDIA/ais-k8s/tree/master/operator which seemed to work.

alex-aizman commented 2 years ago

pushed aistore/cluster-minimal and aistore/aisnode images

colineles commented 2 years ago

@alex-aizman I think the most recent aisnode image is missing curl for some reason:

getting the following error: Readiness probe failed: /var/ais_config/ais_readiness.sh: line 4: /var/ais_env/env: No such file or directory /var/ais_config/ais_readiness.sh: line 14: curl: command not found /var/ais_config/ais_readiness.sh: line 21: curl: command not found

gaikwadabhishek commented 2 years ago

Hi @colineles just updated the image, can you check now?

colineles commented 2 years ago

@gaikwadabhishek thanks, seems to be working now

ondave commented 2 years ago

FYI - seems that GKE PVC with xfs filesystem does not maintain the FsID between mounts (why?) so fails as described here with both the block device and the FsID not consistent. However so far ext4 works fine.

alex-aizman commented 2 years ago

@ondave it works for us (otherwise wouldn't close) - which version are you using?

ondave commented 2 years ago

We were actually just testing on latest. Bit confused as the the reported version in the CLI is 0.93.ae32s99, whereas github tags are at 3.12? Have you tested with xfs as well as ext4? The thread above was an issue with an ext4 PVC storage class. We tried initially with xfs and had the problem of devices AND FsID not remaining stable between restarts. Once we changed to ext4, the FsIDs then remained the same between restarts (devices still did not), and so 9381937 ensured that the integrity check passed. If you are happy it is all working with xfs on GKE, then all good, and maybe we are doing something dumb. This is just intended as an FYI, not a bug report as we haven't done enough testing to properly isolate this as an issue. And this is only something we have tried on GKE, so may also be specific to the Google managed PVCs,