Kubeflow: Intermittent Notebook Shutdowns

StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform

https://statcan.github.io/aaw/

Other

67 stars 12 forks source link

Kubeflow: Intermittent Notebook Shutdowns #1532

Open Souheil-Yazji opened 1 year ago

Souheil-Yazji commented 1 year ago

This is a long-run issue for tracking instances of Intermittent Notebook Shutdowns in user notebooks.

When reporting an occurance, please follow the template provided bellow:

Environment info

Namespace:

Notebook/server:

Date-Time of shutdown:

Workload

A clear and concise description of what your workload was.

Additional context

Add any other context about the problem here.

Souheil-Yazji commented 1 year ago

Tracking https://github.com/StatCan/daaas/issues/1518 as an occurance.

StanHatko commented 1 year ago

There was another unexpected restart of a remote desktop image this morning.

Environment info

Namespace: woody-biomass-detection

Notebook/server: geoviewer

Date-Time of shutdown: 9:50 AM EST

Workload

A geomatics remote desktop image, at the time of the restart it was GDAL command line programs to process imagery. When rerunning the job after the restart it completed successfully and used far less than the 4 CPU cores and 16 GiB of RAM allocated (since I was not sure how much CPU and RAM it would take).

Additional context

Here is a screenshot of the restart, with any potentially sensitive information redacted and a green line placed immediately before the node deletion at 9:50:19 AM.

vexingly commented 1 year ago

Can confirm the specific instance from Stan above is related to the node being removed due to low usage:

I0214 14:50:19.566823 1 scale_down.go:1021] Scale-down: removing node aks-useruc-22411332-vmss000004, utilization: 
{0.2884919380335125 0.38326724176924093 0 memory 0.38326724176924093}, 
pods to reschedule: csdava-342-1/root-0,csdava-342-1/n-0,woody-biomass-detection/geoviewer-0,deil-accessibility/marcello-vh-test-0,reginald-maltais/imdb-web-0,bar-rtlbci/processing-0

This appears to be a result of a recent change where we use larger nodes, the 'utilization' metric is influenced by the greater capacity of the node compared to the workload size:

scale-down-utilization-threshold | Node utilization level, defined as sum of requested resources divided by capacity, below which a node can be considered for scale down

We will have to review these metrics to ensure this does not happen unless absolutely necessary.

StanHatko commented 1 year ago

If there are any jobs running on a node, that node should not normally be removed. With a few exceptions (like GPU nodes and servers that take up the full capacity of the node) users have no control over which node their job runs on and how many other things are running on that node, so removing a node with one or more jobs running effectively means random user workloads are being terminated. If it's necessary to "rebalance" the cluster from time to time, there should be a clearly defined restart date / time in advance so users know their jobs may be terminated and plan runs accordingly.

vexingly commented 1 year ago

@StanHatko we agree and previously it wasn't much of an issue since our nodes were ~16 core machines, but we have recently switched to 64 core machines that are much more expensive to run when there is little usage.

We will need to find a balance for sure to ensure there is as little disruption as possible for users but also that the costs are not too high as a result (the higher count nodes are a cost savings measure in the first place!).