Jcs/update hsds confg - Githubissues

jcschaff commented 1 year ago

The intent is to improve the stability of the HSDS service (and cluster as a whole).

Explicitly supplying container resource specifications is considered best practice, and can help avoid oversubscribed nodes which could starve services during heavy use (e.g. maybe stressing the HSDS services).

1) minor change to HSDS configuration (using new data-node K8s selector): k8s_dn_label_selector: app=hsds instead of k8s_dn_label_selector: app=hsds (which is deprecated) ... both parameters are included for compatibility.

2) K8s container resource specs are now supplied for HSDS and all biosim services. All limits below are newly introduced unless otherwise noted.

hsds-sn (HSDS service node)
- resources.requests.memory: 1Gi
- resources.requests.cpu: 500m
- resources.limits.memory: 1Gi
- resources.limits.cpu: 1000m
hsds-dn (HSDS data node)
- resources.requests.memory: 2Gi
- resources.requests.cpu: 500m
- resources.limits.memory: 2Gi
- resources.limits.cpu: 1000m
combine-api
- resources.requests.memory: 1Gi (unchanged)
- resources.requests.cpu: 500m. (was 25m)
- resources.limits.memory: 2Gi (unchanged)
- resources.limits.cpu: 1000m (unchanged)
account-api
- resources.requests.memory: 500Mi
- resources.requests.cpu: 200m
- resources.limits.memory: 1Gi
- resources.limits.cpu: 500m
simulators-api
- resources.requests.memory: 1Gi
- resources.requests.cpu: 500m
- resources.limits.memory: 2Gi
- resources.limits.cpu: 1000m

bilalshaikh42 commented 1 year ago

The only concern with limits is that the Kuernetes behavior is to kill the pod if it exceeds the memory limits, even when there are resources available. This was the reason I did not include the limits in the first place. It might make sense to set the requests and the limits to the same value to try and a get a Garunteed QOS for the pods instead, but this comes with the same fundamental limitations (pods being killed for exceeding limits) which might just increase the instability

These configurations were very much just a guess and check before I fully understood how this worked, so there is definitely a need for more tweaking and experimentation

jcschaff commented 1 year ago

The only concern with limits is that the Kuernetes behavior is to kill the pod if it exceeds the memory limits, even when there are resources available. This was the reason I did not include the limits in the first place. It might make sense to set the requests and the limits to the same value to try and a get a Garunteed QOS for the pods instead, but this comes with the same fundamental limitations (pods being killed for exceeding limits) which might just increase the instability

These configurations were very much just a guess and check before I fully understood how this worked, so there is definitely a need for more tweaking and experimentation

@bilalshaikh42 thanks for your perspective. I was aware of the pods getting killed if they exceeded the limits, but was not aware of the Guaranteed QOS ... will look into it. I'll investigate further before I merge.

biosimulations / deployment

Jcs/update hsds confg #73

The intent is to improve the stability of the HSDS service (and cluster as a whole).