aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
841 stars 313 forks source link

awsbatch: add support for ubuntu1804 #1585

Open microbioticajon opened 4 years ago

microbioticajon commented 4 years ago

Environment:

`[global] cluster_template = default update_check = true sanity_check = true

[aws] aws_region_name = eu-west-2

[aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[vpc bioinformatics] vpc_id = vpc-#### master_subnet_id = subnet-076#### compute_subnet_id = subnet-076#### ssh_from = 172.3.3.10/32 additional_sg = sg-#### use_public_ips = false

[cluster default] key_name = ### master_instance_type = t2.medium

compute_instance_type = t2.micro

scheduler = awsbatch cluster_type = ondemand post_install = s3://####/aws-parallel-cluster-config/post_install.sh post_install_args = "172.3.3.0/24" ephemeral_dir = /scratch base_os = ubuntu1804 s3_read_resource = arn:aws:s3::####/aws-parallel-cluster-config/* ebs_settings = data01,software raid_settings = scratch vpc_settings = bioinformatics

[ebs software] shared_dir = /nfs/software volume_type = gp2 volume_size = 100

[ebs data01] shared_dir = /data01 volume_type = gp2 volume_size = 20

[raid scratch] shared_dir = /raid_scratch raid_type = 0 num_of_raid_volumes = 4 volume_type = gp2 volume_size = 250`

Bug description and how to reproduce:

Hi Guys,

Could someone have a quick look at a problem Im having?

pcluster is failing in CodeBuild for awsbatch clusters for ubuntu1804. It is very likely an IAM permissions problem at my end but the CodeBuild logs suggest that the ubuntu1804 docker image is missing - Im assuming that it is not looking in my own ECR but the CodeBuild logs indicate that an attempt to log into ECR (possibly for push purposes??).

There are no errors reported in /var/logs/cfn-init-cmd.log on the master node.

Any help would be appreciated. Best, J

Additional context: Have a look at my CodeBuild logs for details:

`[Container] 2020/01/08 16:07:50 Waiting for agent ping [Container] 2020/01/08 16:07:52 Waiting for DOWNLOAD_SOURCE [Container] 2020/01/08 16:07:52 Phase is DOWNLOAD_SOURCE [Container] 2020/01/08 16:07:52 CODEBUILD_SRC_DIR=/codebuild/output/src871444249/src [Container] 2020/01/08 16:07:52 YAML location is /codebuild/output/src871444249/src/buildspec.yml [Container] 2020/01/08 16:07:52 No commands found for phase name: INSTALL [Container] 2020/01/08 16:07:52 Processing environment variables [Container] 2020/01/08 16:07:52 Moving to directory /codebuild/output/src871444249/src [Container] 2020/01/08 16:07:52 Registering with agent [Container] 2020/01/08 16:07:52 Phases found in YAML: 4 [Container] 2020/01/08 16:07:52 BUILD: 4 commands [Container] 2020/01/08 16:07:52 POST_BUILD: 3 commands [Container] 2020/01/08 16:07:52 INSTALL: 0 commands [Container] 2020/01/08 16:07:52 PRE_BUILD: 2 commands [Container] 2020/01/08 16:07:52 Phase complete: DOWNLOAD_SOURCE State: SUCCEEDED [Container] 2020/01/08 16:07:52 Phase context status code: Message:
[Container] 2020/01/08 16:07:52 Entering phase INSTALL [Container] 2020/01/08 16:07:52 Running command echo "Installing Docker version 18 ..." Installing Docker version 18 ...

[Container] 2020/01/08 16:07:52 Phase complete: INSTALL State: SUCCEEDED [Container] 2020/01/08 16:07:52 Phase context status code: Message:
[Container] 2020/01/08 16:07:52 Entering phase PRE_BUILD [Container] 2020/01/08 16:07:52 Running command echo Logging in to Amazon ECR... Logging in to Amazon ECR...

[Container] 2020/01/08 16:07:52 Running command $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION) WARNING! Using --password via the CLI is insecure. Use --password-stdin. WARNING! Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

[Container] 2020/01/08 16:07:56 Phase complete: PRE_BUILD State: SUCCEEDED [Container] 2020/01/08 16:07:56 Phase context status code: Message:
[Container] 2020/01/08 16:07:56 Entering phase BUILD [Container] 2020/01/08 16:07:56 Running command echo Build started on date Build started on Wed Jan 8 16:07:56 UTC 2020

[Container] 2020/01/08 16:07:56 Running command echo Building the Docker images... Building the Docker images...

[Container] 2020/01/08 16:07:56 Running command sh ./build-docker-images.sh Building image ubuntu1804 Dockerfile not found for image ubuntu1804. Exiting...

[Container] 2020/01/08 16:07:56 Command did not exit successfully sh ./build-docker-images.sh exit status 1 [Container] 2020/01/08 16:07:56 Phase complete: BUILD State: FAILED [Container] 2020/01/08 16:07:56 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: sh ./build-docker-images.sh. Reason: exit status 1 [Container] 2020/01/08 16:07:56 Entering phase POST_BUILD [Container] 2020/01/08 16:07:56 Running command if [ $CODEBUILD_BUILD_SUCCEEDING = 0 ]; then echo Build failed; exit 1; fi Build failed

[Container] 2020/01/08 16:07:56 Command did not exit successfully if [ $CODEBUILD_BUILD_SUCCEEDING = 0 ]; then echo Build failed; exit 1; fi exit status 1 [Container] 2020/01/08 16:07:56 Phase complete: POST_BUILD State: FAILED [Container] 2020/01/08 16:07:56 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: if [ $CODEBUILD_BUILD_SUCCEEDING = 0 ]; then echo Build failed; exit 1; fi. Reason: exit status 1 `

enrico-usai commented 4 years ago

Hi @microbioticajon

Currently, when using awsbatch as as scheduler, the only supported operating system is alinux.

Thank you for reporting this. I'm going to mark it as a bug because we have to add a validation of the configuration parameters to detect this case before creation.

Why did you select Ubuntu18? Is it ok for you to use Amazon Linux as operating system?

microbioticajon commented 4 years ago

Thanks @enrico-usai

I must have missed that bit about alinux in the docs - apologies.

Its mainly because our current system is based and validated on Ubuntu 1804. Also all our build scripts are geared for compiling with apt and we have a lot of tools to install and we wanted to maintain some consistency if we moved to using parallel-cluster. Im sure we would hit problems with missing system libraries in the ECR containers but I was going to investigate those once I was up and running :-)

It is not an insurmountable problem, it will just require a bit more work at our end if we want to use Batch. Batch would be nicer as it provides a heterogeneous compute environment for when we have large memory jobs.

Best, J

enrico-usai commented 4 years ago

Thanks for describing your use case.

I removed the bug label since I fixed the validator and the documentation and added the enhancement one to keep track of your request.