Handle 'unhealthy' orchestration volume

donaldgray commented 1 year ago

Depending on the setup the Orchestrator and Cantaloupe instances may be sharing a NAS. If this becomes unhealthy there will be nowhere to orchestrate images too - we need an alternative strategy for handling these.

One option would be to redirect all traffic to 'special server' as these do not need orchestration volumes, they access the origin directly.

tomcrane commented 1 year ago

How does the volume recover?

donaldgray commented 1 year ago

How does the volume recover?

It depends on what underlying technology that volume uses. Assuming it's aws fsx for lustre then that's a managed service so it would become healthy again (whatever that means) - at worst we would lose orchestrated files and need to start orchestrating everything.

The more likely issue could also be that the AZ the Lustre volume is in goes down - again, that would come up once the issue is resolved.

donaldgray commented 1 year ago

Note on issue encountered with slf.

We encountered a loss of fsx for lustre volume during slf testing. This had been running for ~9months in us-east-1 region. We didn't have full logging enabled so didn't get warnings (assuming there were some). The only error given in AWS console was "Please delete your file system and create a new one" so it doesn't look like volumes recover. This is using a SCRATCH_2 volume - persistent may differ as it does backups.

Related - we need to revisit our cloud-init script as this outage prevented the EC2 instance from joining the cluster. Resolution was to taint and reprovision a new volume which took ~5.5mins.

dlcs / protagonist

Handle 'unhealthy' orchestration volume #498