aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
828 stars 312 forks source link

awsbatch based GPU (CUDA) job failed because of missing GPU #1878

Open trsludwig opened 4 years ago

trsludwig commented 4 years ago

Environment:

Bug description and how to reproduce: after creating an awsbatch based GPU environment, the job cannot be done because a GPU cannot be found. AWS Batch creates an ECS Container with an gpu-based ami, but the Job searches for ja free GPU and does not found one.

here is the log of the job:

| 2020-07-03T17:45:35.949+02:00 | Job id: 334ea99b-755c-4660-9baa-2981aa1b56e4:0
-- | --
  | 2020-07-03T17:45:35.949+02:00 | Initializing the environment...
  | 2020-07-03T17:45:35.949+02:00 | Starting ssh agents...
  | 2020-07-03T17:45:35.956+02:00 | Agent pid 7
  | 2020-07-03T17:45:35.961+02:00 | Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
  | 2020-07-03T17:45:36.452+02:00 | Mounting /home...
  | 2020-07-03T17:45:36.547+02:00 | Mounting shared file system...
  | 2020-07-03T17:45:36.631+02:00 | Starting the job...
  | 2020-07-03T17:45:36.633+02:00 | /root
  | 2020-07-03T17:45:36.634+02:00 | Fri Jul 3 15:45:36 UTC 2020
  | 2020-07-03T17:45:36.635+02:00 | ls: cannot access /usr/lib/libXt*: No such file or directory
  | 2020-07-03T17:45:36.637+02:00 | ls: cannot access /usr/lib64/libXt*: No such file or directory
  | 2020-07-03T17:45:36.759+02:00 | Loaded plugins: ovl, priorities
  | 2020-07-03T17:45:39.538+02:00 | Resolving Dependencies
  | 2020-07-03T17:45:39.539+02:00 | --> Running transaction check
  | 2020-07-03T17:45:39.539+02:00 | ---> Package libXt.x86_64 0:1.1.5-3.amzn2.0.2 will be installed
  | 2020-07-03T17:45:39.547+02:00 | --> Processing Dependency: libSM.so.6()(64bit) for package: libXt-1.1.5-3.amzn2.0.2.x86_64
  | 2020-07-03T17:45:39.691+02:00 | --> Processing Dependency: libICE.so.6()(64bit) for package: libXt-1.1.5-3.amzn2.0.2.x86_64
  | 2020-07-03T17:45:39.693+02:00 | --> Running transaction check
  | 2020-07-03T17:45:39.693+02:00 | ---> Package libICE.x86_64 0:1.0.9-9.amzn2.0.2 will be installed
  | 2020-07-03T17:45:39.695+02:00 | ---> Package libSM.x86_64 0:1.2.2-2.amzn2.0.2 will be installed
  | 2020-07-03T17:45:39.839+02:00 | --> Finished Dependency Resolution
  | 2020-07-03T17:45:39.850+02:00 | Dependencies Resolved
  | 2020-07-03T17:45:39.851+02:00 | ================================================================================
  | 2020-07-03T17:45:39.851+02:00 | Package Arch Version Repository Size
  | 2020-07-03T17:45:39.851+02:00 | ================================================================================
  | 2020-07-03T17:45:39.851+02:00 | Installing:
  | 2020-07-03T17:45:39.851+02:00 | libXt x86_64 1.1.5-3.amzn2.0.2 amzn2-core 177 k
  | 2020-07-03T17:45:39.851+02:00 | Installing for dependencies:
  | 2020-07-03T17:45:39.851+02:00 | libICE x86_64 1.0.9-9.amzn2.0.2 amzn2-core 67 k
  | 2020-07-03T17:45:39.851+02:00 | libSM x86_64 1.2.2-2.amzn2.0.2 amzn2-core 39 k
  | 2020-07-03T17:45:39.851+02:00 | Transaction Summary
  | 2020-07-03T17:45:39.851+02:00 | ================================================================================
  | 2020-07-03T17:45:39.851+02:00 | Install 1 Package (+2 Dependent packages)
  | 2020-07-03T17:45:39.851+02:00 | Total download size: 284 k
  | 2020-07-03T17:45:39.851+02:00 | Installed size: 656 k
  | 2020-07-03T17:45:39.852+02:00 | Downloading packages:
  | 2020-07-03T17:45:39.955+02:00 | --------------------------------------------------------------------------------
  | 2020-07-03T17:45:39.955+02:00 | Total 2.7 MB/s \| 284 kB 00:00
  | 2020-07-03T17:45:39.961+02:00 | Running transaction check
  | 2020-07-03T17:45:40.104+02:00 | Running transaction test
  | 2020-07-03T17:45:40.114+02:00 | Transaction test succeeded
  | 2020-07-03T17:45:40.115+02:00 | Running transaction
  | 2020-07-03T17:45:40.191+02:00 | Installing : libICE-1.0.9-9.amzn2.0.2.x86_64 1/3
  | 2020-07-03T17:45:40.235+02:00 | Installing : libSM-1.2.2-2.amzn2.0.2.x86_64 2/3
  | 2020-07-03T17:45:40.262+02:00 | Installing : libXt-1.1.5-3.amzn2.0.2.x86_64 3/3
  | 2020-07-03T17:45:40.275+02:00 | Verifying : libSM-1.2.2-2.amzn2.0.2.x86_64 1/3
  | 2020-07-03T17:45:40.283+02:00 | Verifying : libICE-1.0.9-9.amzn2.0.2.x86_64 2/3
  | 2020-07-03T17:45:40.322+02:00 | Verifying : libXt-1.1.5-3.amzn2.0.2.x86_64 3/3
  | 2020-07-03T17:45:40.322+02:00 | Installed:
  | 2020-07-03T17:45:40.322+02:00 | libXt.x86_64 0:1.1.5-3.amzn2.0.2
  | 2020-07-03T17:45:40.322+02:00 | Dependency Installed:
  | 2020-07-03T17:45:40.322+02:00 | libICE.x86_64 0:1.0.9-9.amzn2.0.2 libSM.x86_64 0:1.2.2-2.amzn2.0.2
  | 2020-07-03T17:45:40.322+02:00 | Complete!
  | 2020-07-03T17:45:40.343+02:00 | ls: cannot access /usr/lib/libXt*: No such file or directory
  | 2020-07-03T17:45:40.345+02:00 | /usr/lib64/libXt.so.6
  | 2020-07-03T17:45:40.345+02:00 | /usr/lib64/libXt.so.6.0.0
  | 2020-07-03T17:45:40.345+02:00 | /home/ec2-user/cluster/fq_arrycontrol_tp40460826_5111_4a74_9603_11d6566c7c28.mat
  | 2020-07-03T17:45:40.345+02:00 | 1
  | 2020-07-03T17:45:48.115+02:00 | HawkSpex(R)Flow - Analytics Workflow
  | 2020-07-03T17:45:48.115+02:00 | Version 1.5-cecdccf4bf72cc35356263026a63aceb6d862540
  | 2020-07-03T17:45:48.115+02:00 | \n
  | 2020-07-03T17:45:48.213+02:00 | remoteArrayJob fargFile /home/ec2-user/cluster/fq_arrycontrol_tp40460826_5111_4a74_9603_11d6566c7c28.mat jobCounter 1
  | 2020-07-03T17:45:48.220+02:00 | Load Data from /home/ec2-user/cluster/fq_arrycontrol_tp40460826_5111_4a74_9603_11d6566c7c28.mat
  | 2020-07-03T17:45:48.244+02:00 | Starting Job with Number: 1
  | 2020-07-03T17:45:48.246+02:00 | /home/ec2-user/cluster/fq_train_model_00001_tpd815fbed_6466_4b49_8f97_6296dea89e3a.mat
  | 2020-07-03T17:45:48.883+02:00 | Test if one GPU is free of 0
  | 2020-07-03T17:45:48.883+02:00 | INFO Param wait4GPU: 5 min
  | 2020-07-03T17:45:48.883+02:00 | INFO Param maxGPUload: 35 %
  | 2020-07-03T17:45:48.886+02:00 | INFO Param maxGPURAMload: 50 %
  | 2020-07-03T17:45:48.887+02:00 | Using CPU!
  | 2020-07-03T17:45:49.905+02:00 |  
  | 2020-07-03T17:45:49.905+02:00 | Computing Resources:
  | 2020-07-03T17:45:49.909+02:00 | MEX on GLNXA64
  | 2020-07-03T17:45:49.909+02:00 |  
  | 2020-07-03T17:45:50.204+02:00 | Finish MLP training
  | 2020-07-03T17:45:50.205+02:00 | Model NNET:MLP learnt with average 0.022353 sec per iteration
  | 2020-07-03T17:45:52.911+02:00 | Fri Jul 3 15:45:52 UTC 2020

as you can see in the last part our script tries to find a free GPU and cannot find one. As a result of this error the script echos that it is using a CPU instead.

Does i have to set a special parameter to use aws-parallelcluster for an awsbatch CUDA based GPU Environment? I can't find any documentation or manuals on how to create a gpu-based cluster via aws-parallelcluster using awsbatch as scheduler and what is the best configuration for it.

the pcluster-config is the following:

[aws]
aws_region_name = eu-central-1

[global]
cluster_template = default
sanity_check = true
update_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default]
key_name = XXXXXXXXXXXXXXXXXXXXX
base_os = alinux2
scheduler = awsbatch
master_instance_type = m5.large
desired_vcpus = 4
post_install = https://XXXXXXXXXXXXXXXXX.s3.amazonaws.com/post_install.sh
master_root_volume_size = 200
extra_json = { "cluster" : { "ganglia_enabled" : "yes" } }
vpc_settings = default
s3_read_resource = *

[cluster GPU]
key_name = XXXXXXXXXXXXXXXXXXXXXX
base_os = alinux2
scheduler = awsbatch
compute_instance_type = p3.2xlarge
master_instance_type = m5.large
desired_vcpus = 4
post_install = https://XXXXXXXXXXXXXXXXXXXXXXX.s3.amazonaws.com/post_install-gpu.sh
master_root_volume_size = 200
extra_json = { "cluster" : { "ganglia_enabled" : "yes" } }
vpc_settings = default
s3_read_resource = *

[vpc default]
vpc_id = vpc-XXXXXXXXXXXX
master_subnet_id = subnet-XXXXXXXXX
compute_subnet_id = subnet-XXXXXXXX
additional_sg = sg-XXXXXXXXX
rexcsn commented 4 years ago

Hi @trsludwig,

Unfortunately GPU options is not currently supported for ParallelCluster-AWSBatch integration. I will mark this issue as a feature request. To run GPU jobs using AWSBatch console or CLI directly, please see the official documentation here

If you are still interested in using ParallelCluster for GPU workflow, please check out ParallelCluster's integration with Slurm, of which we do support specifying GPU option in job submission commands.

Hope that helps!

trsludwig commented 4 years ago

Hi @rexcsn,

thank you for your quick response.

Could you please include it in the official documentation that the awsbatch scheduler cannot process jobs with GPU support. Then please include a reference to the Slurm Scheduler.

I hope that you will implement this enhancement very soon, since AWS Batch itself has the ability to run GPU-based jobs.

many thanks in advance and stay healthy!

best regards Sebastian