awslabs / autonomous-driving-data-framework

ADDF is a collection of modules, deployed using the SeedFarmer orchestration tool. ADDF modules enable users to quickly bootstrap environments for the process and analysis of autonomous driving data.
Apache License 2.0
113 stars 44 forks source link

[BUG] ML on EKS with GPU does not autoscale #414

Closed dgraeber closed 8 months ago

dgraeber commented 8 months ago

When deploying the solution at: autonomous-driving-data-framework/manifests/ml-training-on-eks and running the Step Function that is created, the GPU nodes do not scale up and the processing is in a stopped state. Indications are due to a taint on the compute nodes? or an issue with the fsx.csi driver and the luster-on-ecks integration modules...?

ref @kevinsoucy @a13zen for details

To replicate:

  1. deploy the solution
  2. kick off the step functions
  3. watch and wait
dgraeber commented 8 months ago

The current release is release/3.0.0 . This issue is on main and has not been released...

dgraeber commented 8 months ago

NOTES for posterity: As per @kevinsoucy and @a13zen the GPU compute node is recognized and was used for training. The Autoscaling was not working properly, but if manually scaled it does work....

dgraeber commented 8 months ago

This is dependent on:

dgraeber commented 8 months ago

Root Cause determined:

  1. ASG taint prevented new nodes from being available --- /autonomous-driving-data-framework/modules/ml-training/k8s-managed/configure_asgs.py line 63-67 should be removed

    1. older fsx-driver was not removing the taint necessary to prevent race condition --- existing version of fsx_driver:version: 1.5.1 installed aws-fsx-csi-driver:v0.9.0 image may not have the latest code --- used fsx_driver:version: 1.8.0 and it installed aws-fsx-csi-driver:v1.1.0 and it removed the taint as add by this PR https://github.com/awslabs/idf-modules/pull/140

    Needed to remediate:

    1. use PR https://github.com/awslabs/idf-modules/pull/140 and add the taint as described:
         eks_node_taints:
          - key: fsx.csi.aws.com/agent-not-ready
            effect: NO_EXECUTE
  2. remove the ASG taint in ml-training/k8s-managed/configure_asgs.py (line 63-67 k8s.io/cluster-autoscaler/node-template/taint/dedicated )https://github.com/awslabs/autonomous-driving-data-framework/issues/418
  3. update the fsx driver supported version in the dataFile (in idf-modules https://github.com/awslabs/idf-modules/issues/141)