Closed dvitale199 closed 3 weeks ago
I'll take a look and try to reproduce the issue.
Did you deploy the blueprint after you created it?
./ghpc deploy genoslurm-batch-us-central1 --auto-approve
yes, apologies, I forgot to include that in the steps above. it is currently deployed and I have successfully run jobs
I haven't been able to reproduce the issue, but I will continue trying. In the meantime, could you try removing the resources in your project by hand, recreating the deployment folder (perhaps with a different deployment name) and trying to deploy and destroy again?
Another place to look is in genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-job-template/outputs.tf
and see if the job_data
output exists there. That's what the issue you posted is complaining about.
done. job_data exists in outputs.tf as it did in the previous test:
output "job_data" {
description = "All data associated with the defined job, typically provided as input to clout-batch-login-node."
value = {
template_contents = local.job_template_contents,
filename = local.job_filename,
id = local.submit_job_id
}
}
does this need to be a list?
Thanks, one more quick question before I try and dive deeper. Could you try cloning a clean copy of the repository, building ghpc, and running the same steps with the original blueprint? The only thing you should change is the project name.
I took a look at the code and nothing obvious stood out. This message seems a bit odd to me module.batch-job
is object with 3 attributesbecause
batch-jobis the sub-module and should have more than 3 attributes. Really it is
module.batch-job.job_data` that should have 3 attributes. But maybe this is just terraform having a bit of a weird error.
@dvitale199, could you update your blueprint to print the job_data
output. To do this you would add job_data
to the outputs list of the batch-job (as shown below). Then when you call ghpc deploy
it should print an output at the end containing the contents of job_data
. There might be something interesting in there, like a null value that should be populated.
- id: batch-job
source: modules/scheduler/batch-job-template
use: [genoslurm-network-us-central1, appfs, lustrefs]
settings:
runnable: "echo 'hello world'"
machine_type: n2-standard-4
outputs: [instructions, job_data]
The other interesting thing is the error message says Hint: terraform plan for deployment group genoslurm-batch-us-central1/primary failed
. I am not sure if it is possible to print out the plan. It might be that it is not possible since it is says it is failing to generate a plan. To do this you would call ghpc destroy
(no --auto-approve) and then when prompted select the display
option (d
). There might be a clue in there.
Ok I've done this but when I use ghpc destroy, it does not prompt me to display, it just fails with the same error. I've attached the display from the deploy command below: genoslurm-batch-test-create-display.txt
I've also tried this using the gcluster command instead of ghpc. unsure if there is a difference but got the same result. going to try one of the example configs and see if there's any difference.
Is there a potential path issue here? I store my .yaml in cluster-toolkit/ and run everything with ./ghpc from. cluster-toolkit.
I'm very sorry for the trouble. I believe everything is working alright besides the destroy, which is only a minor inconvenience for testing. If I figure anything out I will comment back. I appreciate the help.
No worries about the trouble. I'll keep looking into this.
Which of the things did you try from @nick-stroud and my responses?
Another quick question is: which version of Terraform are you using?
I tried adding the job_data to outputs of batch-job for which I attached the output above. I also tried running ghpc destroy with no --auto-approve but it fails before the option to display is given.
I have Terraform v1.3.7 on darwin_arm64
@dvitale199 A few questions that might lead to some clues:
-w
flag, or did you delete the deployment folder before calling ghpc create
?job_data
variable in this file in your deployment folder? genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-login-node/variables.tf
@dvitale199 A few questions that might lead to some clues:
- When recreating the deployment folder in your second test, did you use the
-w
flag, or did you delete the deployment folder before callingghpc create
?- When you re-created the deployment folder, which version of ghpc did you use?
- Do you see the
job_data
variable in this file in your deployment folder?genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-login-node/variables.tf
ghpc create
since my last comment, I've pulled a fresh clone and just tested with the examples/hpc-slurm.yaml and have not run into any issues creating and destroying the config.
I'm wondering if passing use: [genoslurm-network-us-central1, appfs, lustrefs] to the batch nodeset had anything to do with it?
are use: [genoslurm-network-us-central1, appfs, lustrefs]
and
use:
- genoslurm-network-us-central1
- appfs
- lustrefs
synonymous?
thanks for all your help.
That's great to hear you were able to get the expected behavior from a fresh clone!
Regarding the use
block - yes, syntax in both of your examples are synonymous. They are just two different ways to represent a list in YAML.
I will close this issue since it appears to be resolved, but please feel free to reopen or create a new issue if you encounter other problems!
Describe the bug
I followed the tutorial video https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/v1.9.0/docs/videos/build-your-own-blueprint and made only small adjustments, not including the rocky linux image (check blueprint below).
After testing that the image was functioning, I attempted to use
./ghpc destroy <deployment_name_dir>
and gotSteps to reproduce
Steps to reproduce the behavior:
gcloud batch jobs submit batch-job-3ad191a4 --config=/path/to/batch-job.yaml --location=us-central1 --project=<project>
Expected behavior
deployment properly shut down and destroyed
Actual behavior
the object does not have an attribute named "job_data"
Version (
ghpc --version
)ghpc version v1.36.1 Built from 'main' branch. Commit info: v1.36.1-0-g493308e7
Blueprint
Expanded Blueprint
Output and logs
Execution environment
ps -p $$
): /bin/shAdditional context
Apologies if this is a simple misunderstanding rather than a bug!