aws-samples / 1click-hpc

Deploy your HPC Cluster on AWS in 20min. with just 1-Click.
MIT No Attribution
62 stars 44 forks source link

DCV Jobs failing #2

Open sean-smith opened 3 years ago

sean-smith commented 3 years ago

Not sure if there's a setup step that I'm missing here but when I run the included Windows or Linux DCV job I get:

sbatch failed (parameters: -J Linux_Desktop -D /fsx/nice/enginframe/sessions/ec2-user/tmp4716553958834750820.session.ef -C dcv2, exit value: 1)
nicolaven commented 3 years ago

you need to use the proper serivce to launch the desktop. The one build-in is not appropriate. We are working on building a repository for EF services.

sean-smith commented 3 years ago

Is there any documentation on what the correct service is?

mirneshalilovic commented 1 year ago

@sean-smith @nicolaven I have the same problem. Could you please share one example here how to start interactive service?

Thank in advance.

nicolaven commented 1 year ago

hi @mirneshalilovic Thanks for your request. Can you try deploy a new cluster using the latest version of 1Click-HPC (a few updates have been released just recently). Then log into EF and import the following test service: https://github.com/aws-samples/1click-hpc/blob/main/enginframe/ef-services.Linux%20Desktop.2022-11-10T12-18-39.zip

Thanks

mirneshalilovic commented 1 year ago

hi @nicolaven Thanks for information and for updated service. It's working but only for dcv gue. I can't run interactive with gpu enabled. I did the last deployment last night.

With this command vdi.launch.session --queue dcv sessions can be launched and I can enter.

When I specified vdi.launch.session --queue dcv-gpu --submitopts "-C g4dn.2xlarge" sessions is in pending state and no machine in background.

On slurmctld logs I can see this:


`2022-11-10T15:08:44.974+01:00  [2022-11-10T14:08:44.974] sched: Allocate JobId=58 NodeList=dcv-gpu-dy-g4dn-2xlarge-1 #CPUs=1 Partition=dcv-gpu

2022-11-10T15:09:03.687+01:00   [2022-11-10T14:09:03.687] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 reason set to: (Code:VcpuLimitExceeded)Failure when resuming nodes

2022-11-10T15:09:03.687+01:00   [2022-11-10T14:09:03.687] requeue job JobId=58 due to failure of node dcv-gpu-dy-g4dn-2xlarge-1

2022-11-10T15:09:03.688+01:00   [2022-11-10T14:09:03.688] Requeuing JobId=58

2022-11-10T15:09:03.688+01:00   [2022-11-10T14:09:03.688] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 state set to DOWN

2022-11-10T15:09:03.706+01:00   [2022-11-10T14:09:03.706] error: get_addr_info: getaddrinfo() failed: Name or service not known

2022-11-10T15:09:03.706+01:00

Copy
[2022-11-10T14:09:03.706] error: slurm_set_addr: Unable to resolve "dcv-gpu-dy-g4dn-2xlarge-1"
[2022-11-10T14:09:03.706] error: slurm_set_addr: Unable to resolve "dcv-gpu-dy-g4dn-2xlarge-1"

2022-11-10T15:09:03.706+01:00   [2022-11-10T14:09:03.706] error: fwd_tree_thread: can't find address for host dcv-gpu-dy-g4dn-2xlarge-1, check slurm.conf

2022-11-10T15:10:01.590+01:00   [2022-11-10T14:10:01.590] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 reason set to: Scheduler health check failed

2022-11-10T15:10:01.590+01:00   [2022-11-10T14:10:01.590] powering down node dcv-gpu-dy-g4dn-2xlarge-1`
nicolaven commented 1 year ago

you need to request a limit increase for g* instances. Then you should be fine.

mirneshalilovic commented 1 year ago

@nicolaven Thank you very much for your quick reply and help. I managed to solve this.