aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
148 stars 68 forks source link

Cluster creation fails due to invalid json in provisioning_parameters.json #234

Open sean-smith opened 3 months ago

sean-smith commented 3 months ago

The provisioning_parameters.json needs to be valid json or the cluster creation will fail, for example the following json is missing the partition_name value:

{
  "version": "1.0.0",
  "workload_manager": "slurm",
  "controller_group": "controller-machine",
  "worker_groups": [
    {
      "instance_group_name": "worker-group-1",
      "partition_name":
    }
  ],
  "fsx_dns_name": "fs-05dac34e835f2c...fsx.us-west-2.amazonaws.com",
  "fsx_mountname": "4owup..."
}

This passes the validator but fails cluster creation.

This is addressed in https://github.com/aws-samples/awsome-distributed-training/pull/233

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 30 days with no activity.