GoogleCloudPlatform / cluster-toolkit

Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
Apache License 2.0
186 stars 124 forks source link

Error: Unsupported attribute when attempting to destroy #2837

Closed dvitale199 closed 3 weeks ago

dvitale199 commented 1 month ago

Describe the bug

I followed the tutorial video https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/v1.9.0/docs/videos/build-your-own-blueprint and made only small adjustments, not including the rocky linux image (check blueprint below).

After testing that the image was functioning, I attempted to use ./ghpc destroy <deployment_name_dir> and got

Steps to reproduce

Steps to reproduce the behavior:

  1. ./ghpc create genoslurm_blueprint.yaml
  2. submit test job: gcloud batch jobs submit batch-job-3ad191a4 --config=/path/to/batch-job.yaml --location=us-central1 --project=<project>
  3. ./ghpc destroy genoslurm-batch-us-central1

Expected behavior

deployment properly shut down and destroyed

Actual behavior

the object does not have an attribute named "job_data"

Version (ghpc --version)

ghpc version v1.36.1 Built from 'main' branch. Commit info: v1.36.1-0-g493308e7

Blueprint

blueprint_name: genoslurm-blueprint

vars:
  project_id: <project>
  deployment_name: genoslurm-batch-us-central1
  region: us-central1
  zone: us-central1-a

deployment_groups:
- group: primary
  modules:
  - id: genoslurm-network-us-central1
    source: modules/network/vpc

  - id: appfs
    source: modules/file-system/filestore
    use: [genoslurm-network-us-central1]
    settings:
      local_mount: /apps

  - id: lustrefs
    source: community/modules/file-system/DDN-EXAScaler
    use: [genoslurm-network-us-central1]
    settings: {local_mount: /scratch}

  - id: batch-job
    source: modules/scheduler/batch-job-template
    use: [genoslurm-network-us-central1, appfs, lustrefs]
    settings:
      runnable: "echo 'hello world'"
      machine_type: n2-standard-4
    outputs: [instructions]

  - id: batch-login
    source: modules/scheduler/batch-login-node
    use: [batch-job]
    outputs: [instructions]

Expanded Blueprint

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

blueprint_name: genoslurm-blueprint
ghpc_version: v1.36.1-0-g493308e7
vars:
  deployment_name: genoslurm-batch-us-central1
  labels:
    ghpc_blueprint: genoslurm-blueprint
    ghpc_deployment: ((var.deployment_name))
  project_id: <project>
  region: us-central1
  zone: us-central1-a
deployment_groups:
  - group: primary
    terraform_providers:
      google:
        source: hashicorp/google
        version: '>= 4.84.0, < 5.32.0'
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      google-beta:
        source: hashicorp/google-beta
        version: '>= 4.84.0, < 5.32.0'
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
    modules:
      - source: modules/network/vpc
        kind: terraform
        id: genoslurm-network-us-central1
        settings:
          deployment_name: ((var.deployment_name))
          project_id: ((var.project_id))
          region: ((var.region))
      - source: modules/file-system/filestore
        kind: terraform
        id: appfs
        use:
          - genoslurm-network-us-central1
        settings:
          deployment_name: ((var.deployment_name))
          labels: ((var.labels))
          local_mount: /apps
          network_id: ((module.genoslurm-network-us-central1.network_id))
          project_id: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      - source: community/modules/file-system/DDN-EXAScaler
        kind: terraform
        id: lustrefs
        use:
          - genoslurm-network-us-central1
        settings:
          labels: ((var.labels))
          local_mount: /scratch
          network_self_link: ((module.genoslurm-network-us-central1.network_self_link))
          project_id: ((var.project_id))
          subnetwork_address: ((module.genoslurm-network-us-central1.subnetwork_address))
          subnetwork_self_link: ((module.genoslurm-network-us-central1.subnetwork_self_link))
          zone: ((var.zone))
      - source: modules/scheduler/batch-job-template
        kind: terraform
        id: batch-job
        use:
          - genoslurm-network-us-central1
          - appfs
          - lustrefs
        outputs:
          - name: instructions
        settings:
          deployment_name: ((var.deployment_name))
          job_id: batch-job
          labels: ((var.labels))
          machine_type: n2-standard-4
          network_storage: ((flatten([module.lustrefs.network_storage, flatten([module.appfs.network_storage])])))
          project_id: ((var.project_id))
          region: ((var.region))
          runnable: echo 'hello world'
          subnetwork: ((module.genoslurm-network-us-central1.subnetwork))
      - source: modules/scheduler/batch-login-node
        kind: terraform
        id: batch-login
        use:
          - batch-job
        outputs:
          - name: instructions
        settings:
          deployment_name: ((var.deployment_name))
          gcloud_version: ((module.batch-job.gcloud_version))
          instance_template: ((module.batch-job.instance_template))
          job_data: ((flatten([module.batch-job.job_data])))
          labels: ((var.labels))
          network_storage: ((flatten([module.batch-job.network_storage])))
          project_id: ((var.project_id))
          region: ((var.region))
          startup_script: ((module.batch-job.startup_script))
          zone: ((var.zone))

Output and logs

Testing if deployment group genoslurm-batch-us-central1/primary requires destroying cloud infrastructure
failed to destroy group "primary":
Error: exit status 1

Error: Unsupported attribute

  on main.tf line 64, in module "batch-login":
  64:   job_data          = flatten([module.batch-job.job_data])
    ├────────────────
    │ module.batch-job is object with 3 attributes

This object does not have an attribute named "job_data".

Hint: terraform plan for deployment group genoslurm-batch-us-central1/primary failed
destruction of "genoslurm-batch-us-central1" failed

Execution environment

Additional context

Apologies if this is a simple misunderstanding rather than a bug!

cdunbar13 commented 1 month ago

I'll take a look and try to reproduce the issue.

cdunbar13 commented 1 month ago

Did you deploy the blueprint after you created it?

./ghpc deploy genoslurm-batch-us-central1 --auto-approve

dvitale199 commented 1 month ago

yes, apologies, I forgot to include that in the steps above. it is currently deployed and I have successfully run jobs

cdunbar13 commented 1 month ago

I haven't been able to reproduce the issue, but I will continue trying. In the meantime, could you try removing the resources in your project by hand, recreating the deployment folder (perhaps with a different deployment name) and trying to deploy and destroy again?

Another place to look is in genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-job-template/outputs.tf and see if the job_data output exists there. That's what the issue you posted is complaining about.

dvitale199 commented 1 month ago

done. job_data exists in outputs.tf as it did in the previous test:

output "job_data" {
  description = "All data associated with the defined job, typically provided as input to clout-batch-login-node."
  value = {
    template_contents = local.job_template_contents,
    filename          = local.job_filename,
    id                = local.submit_job_id
  }
}

does this need to be a list?

cdunbar13 commented 1 month ago

Thanks, one more quick question before I try and dive deeper. Could you try cloning a clean copy of the repository, building ghpc, and running the same steps with the original blueprint? The only thing you should change is the project name.

nick-stroud commented 1 month ago

I took a look at the code and nothing obvious stood out. This message seems a bit odd to me module.batch-job is object with 3 attributesbecausebatch-jobis the sub-module and should have more than 3 attributes. Really it ismodule.batch-job.job_data` that should have 3 attributes. But maybe this is just terraform having a bit of a weird error.

@dvitale199, could you update your blueprint to print the job_data output. To do this you would add job_data to the outputs list of the batch-job (as shown below). Then when you call ghpc deploy it should print an output at the end containing the contents of job_data. There might be something interesting in there, like a null value that should be populated.

  - id: batch-job
    source: modules/scheduler/batch-job-template
    use: [genoslurm-network-us-central1, appfs, lustrefs]
    settings:
      runnable: "echo 'hello world'"
      machine_type: n2-standard-4
    outputs: [instructions, job_data]

The other interesting thing is the error message says Hint: terraform plan for deployment group genoslurm-batch-us-central1/primary failed. I am not sure if it is possible to print out the plan. It might be that it is not possible since it is says it is failing to generate a plan. To do this you would call ghpc destroy (no --auto-approve) and then when prompted select the display option (d). There might be a clue in there.

dvitale199 commented 1 month ago

Ok I've done this but when I use ghpc destroy, it does not prompt me to display, it just fails with the same error. I've attached the display from the deploy command below: genoslurm-batch-test-create-display.txt

I've also tried this using the gcluster command instead of ghpc. unsure if there is a difference but got the same result. going to try one of the example configs and see if there's any difference.

Is there a potential path issue here? I store my .yaml in cluster-toolkit/ and run everything with ./ghpc from. cluster-toolkit.

I'm very sorry for the trouble. I believe everything is working alright besides the destroy, which is only a minor inconvenience for testing. If I figure anything out I will comment back. I appreciate the help.

cdunbar13 commented 1 month ago

No worries about the trouble. I'll keep looking into this.

Which of the things did you try from @nick-stroud and my responses?

Another quick question is: which version of Terraform are you using?

dvitale199 commented 1 month ago

I tried adding the job_data to outputs of batch-job for which I attached the output above. I also tried running ghpc destroy with no --auto-approve but it fails before the option to display is given.

I have Terraform v1.3.7 on darwin_arm64

rohitramu commented 3 weeks ago

@dvitale199 A few questions that might lead to some clues:

  1. When recreating the deployment folder in your second test, did you use the -w flag, or did you delete the deployment folder before calling ghpc create?
  2. When you re-created the deployment folder, which version of ghpc did you use?
  3. Do you see the job_data variable in this file in your deployment folder? genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-login-node/variables.tf
dvitale199 commented 3 weeks ago

@dvitale199 A few questions that might lead to some clues:

  1. When recreating the deployment folder in your second test, did you use the -w flag, or did you delete the deployment folder before calling ghpc create?
  2. When you re-created the deployment folder, which version of ghpc did you use?
  3. Do you see the job_data variable in this file in your deployment folder? genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-login-node/variables.tf
  1. I deleted the previous directory and recreated with ghpc create
  2. latest version from github
  3. yes, job_data was there. I cannot share it because I deleted it by mistake for further testing

since my last comment, I've pulled a fresh clone and just tested with the examples/hpc-slurm.yaml and have not run into any issues creating and destroying the config.

I'm wondering if passing use: [genoslurm-network-us-central1, appfs, lustrefs] to the batch nodeset had anything to do with it?

are use: [genoslurm-network-us-central1, appfs, lustrefs] and

use:
- genoslurm-network-us-central1 
- appfs 
- lustrefs

synonymous?

thanks for all your help.

rohitramu commented 3 weeks ago

That's great to hear you were able to get the expected behavior from a fresh clone!

Regarding the use block - yes, syntax in both of your examples are synonymous. They are just two different ways to represent a list in YAML.

I will close this issue since it appears to be resolved, but please feel free to reopen or create a new issue if you encounter other problems!