[BUG] ML on EKS with GPU does not autoscale

dgraeber commented 8 months ago

When deploying the solution at: autonomous-driving-data-framework/manifests/ml-training-on-eks and running the Step Function that is created, the GPU nodes do not scale up and the processing is in a stopped state. Indications are due to a taint on the compute nodes? or an issue with the fsx.csi driver and the luster-on-ecks integration modules...?

ref @kevinsoucy @a13zen for details

To replicate:

deploy the solution
kick off the step functions
watch and wait

dgraeber commented 8 months ago

The current release is release/3.0.0 . This issue is on main and has not been released...

dgraeber commented 8 months ago

NOTES for posterity: As per @kevinsoucy and @a13zen the GPU compute node is recognized and was used for training. The Autoscaling was not working properly, but if manually scaled it does work....

dgraeber commented 8 months ago

This is dependent on:

dgraeber commented 8 months ago

Root Cause determined:

ASG taint prevented new nodes from being available --- /autonomous-driving-data-framework/modules/ml-training/k8s-managed/configure_asgs.py line 63-67 should be removed
1. older fsx-driver was not removing the taint necessary to prevent race condition --- existing version of fsx_driver:version: 1.5.1 installed aws-fsx-csi-driver:v0.9.0 image may not have the latest code --- used fsx_driver:version: 1.8.0 and it installed aws-fsx-csi-driver:v1.1.0 and it removed the taint as add by this PR https://github.com/awslabs/idf-modules/pull/140
Needed to remediate:
1. use PR https://github.com/awslabs/idf-modules/pull/140 and add the taint as described:
```
   eks_node_taints:
    - key: fsx.csi.aws.com/agent-not-ready
      effect: NO_EXECUTE
```
remove the ASG taint in ml-training/k8s-managed/configure_asgs.py (line 63-67 k8s.io/cluster-autoscaler/node-template/taint/dedicated )https://github.com/awslabs/autonomous-driving-data-framework/issues/418
update the fsx driver supported version in the dataFile (in idf-modules https://github.com/awslabs/idf-modules/issues/141)

awslabs / autonomous-driving-data-framework

[BUG] ML on EKS with GPU does not autoscale #414