Closed jonrkarr closed 2 years ago
The serivce seems to be returning 503s, meaning it is overloaded. Restarting it at that time could results in data loss since the cache is dirty. It might be that removing the results cache is what is causing the extra load. We can try to increase the number of replicas
When running multiple simulations, the service gets stuck, and then after some time recovers. This is probably due to the max pending writes being exceeded. I'll look into what could be causing this. Maybe we need to move the bucket to the same region
Debugging in https://github.com/HDFGroup/hsds/issues/130
This should be resolved by making some changes to the deployment and sbatch script. The bottleneck is the CPU usage which gets very high when loading HDF files
Need to make the following changes on dev, then test, and then prod.
comment on biosimulations/deployment#69 regarding node scaling when there are pending write tasks: In most cases the on_shutdown function in datanode.py should prevent any new write requests from being accepted and flush existing writes. Kubernetes is suppose to give at least 2 seconds for the pod to clean up its business, which I think should be sufficient in most cases.
But you need be sure in your deployment yaml you are telling kubernetes what to do when a pod is terminated. At some point the hsds docker build stopped including curl in the image, so the example yamls weren't doing the right thing. I changed the examples to use /usr/sbin/killall5 rather than /usr/sbin/curl (see: 0248bdd) and can see the on_shutdown function being called at termination.
In the past few runs of ModelDB and BiGG, the HSDS appears to have locked up where Kubernetes thinks the HSDS pods are healthy, but the HSDS fails to process GET or POST requests.
As suggested in biosimulations/biosimulations#4340, a more relevant way of evaluating the health of the HSDS (and restarting it) might be helpful. Perhaps this needs to try uploading and downloading a test HDF5 file. If this fails, the service should be restarted.
This seems to be the most critical issue with the dispatch service. There seem to be more issues, but we'll have to leave that to the future to free up bandwidth for Physiome, BioModels, and UI improvements.