Quansight / open-gpu-server

The Open GPU Server for CI purpose.
8 stars 12 forks source link

[Incident] Server down due to disk full and accidental docker volume deletion #19

Closed aktech closed 9 months ago

aktech commented 9 months ago

There was a degraded performance of the GPU Server this morning, due to the fact that disk ran out of space, causing MySQL, RabbitMQ unresponsive and hence causing degraded and no-response for most OpenStack APIs.

In an effort to fix this, I accidentally deleted the DB docker volume. This has caused losing following important configurations:

aktech commented 9 months ago

I have restored the following so far:

aktech commented 9 months ago

Server is back online, restored networking configurations as well.

aktech commented 9 months ago

Runners back in operation :

Screenshot 2023-12-01 at 8 02 07 pm

Closing this issue.

jakirkham commented 9 months ago

Thanks Amit! 🙏