Closed dgraeber closed 8 months ago
The current release is release/3.0.0
. This issue is on main
and has not been released...
NOTES for posterity: As per @kevinsoucy and @a13zen the GPU compute node is recognized and was used for training. The Autoscaling was not working properly, but if manually scaled it does work....
Root Cause determined:
ASG taint prevented new nodes from being available --- /autonomous-driving-data-framework/modules/ml-training/k8s-managed/configure_asgs.py line 63-67 should be removed
Needed to remediate:
eks_node_taints:
- key: fsx.csi.aws.com/agent-not-ready
effect: NO_EXECUTE
k8s.io/cluster-autoscaler/node-template/taint/dedicated
)https://github.com/awslabs/autonomous-driving-data-framework/issues/418
When deploying the solution at:
autonomous-driving-data-framework/manifests/ml-training-on-eks
and running the Step Function that is created, the GPU nodes do not scale up and the processing is in a stopped state. Indications are due to a taint on the compute nodes? or an issue with the fsx.csi driver and the luster-on-ecks integration modules...?ref @kevinsoucy @a13zen for details
To replicate: