biosimulations / deployment

Kubernetes Configuration for BioSimulations
MIT License
3 stars 1 forks source link

Investigate lockup of HSDS #69

Closed jonrkarr closed 2 years ago

jonrkarr commented 2 years ago

In the past few runs of ModelDB and BiGG, the HSDS appears to have locked up where Kubernetes thinks the HSDS pods are healthy, but the HSDS fails to process GET or POST requests.

As suggested in biosimulations/biosimulations#4340, a more relevant way of evaluating the health of the HSDS (and restarting it) might be helpful. Perhaps this needs to try uploading and downloading a test HDF5 file. If this fails, the service should be restarted.

This seems to be the most critical issue with the dispatch service. There seem to be more issues, but we'll have to leave that to the future to free up bandwidth for Physiome, BioModels, and UI improvements.

bilalshaikh42 commented 2 years ago

The serivce seems to be returning 503s, meaning it is overloaded. Restarting it at that time could results in data loss since the cache is dirty. It might be that removing the results cache is what is causing the extra load. We can try to increase the number of replicas

bilalshaikh42 commented 2 years ago

image

When running multiple simulations, the service gets stuck, and then after some time recovers. This is probably due to the max pending writes being exceeded. I'll look into what could be causing this. Maybe we need to move the bucket to the same region

bilalshaikh42 commented 2 years ago

Debugging in https://github.com/HDFGroup/hsds/issues/130

bilalshaikh42 commented 2 years ago

This should be resolved by making some changes to the deployment and sbatch script. The bottleneck is the CPU usage which gets very high when loading HDF files

Need to make the following changes on dev, then test, and then prod.