Open donaldgray opened 1 year ago
How does the volume recover?
How does the volume recover?
It depends on what underlying technology that volume uses. Assuming it's aws fsx for lustre then that's a managed service so it would become healthy again (whatever that means) - at worst we would lose orchestrated files and need to start orchestrating everything.
The more likely issue could also be that the AZ the Lustre volume is in goes down - again, that would come up once the issue is resolved.
Note on issue encountered with slf.
We encountered a loss of fsx for lustre volume during slf testing. This had been running for ~9months in us-east-1 region. We didn't have full logging enabled so didn't get warnings (assuming there were some). The only error given in AWS console was "Please delete your file system and create a new one" so it doesn't look like volumes recover. This is using a SCRATCH_2
volume - persistent may differ as it does backups.
Related - we need to revisit our cloud-init script as this outage prevented the EC2 instance from joining the cluster. Resolution was to taint and reprovision a new volume which took ~5.5mins.
Depending on the setup the Orchestrator and Cantaloupe instances may be sharing a NAS. If this becomes unhealthy there will be nowhere to orchestrate images too - we need an alternative strategy for handling these.
One option would be to redirect all traffic to 'special server' as these do not need orchestration volumes, they access the origin directly.