Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
121 stars 64 forks source link

Unable to create a cluster out of an HPC Image derived from a VHD - package epel-release is not installed epel-release-7-11.noarch #461

Open souvik-de opened 3 years ago

souvik-de commented 3 years ago

Describe the bug We have a pipeline that allows us to test a CentOS VHD. The pipeline downloads it into a storage account and then creates an image out of it. This image is now feed into the azhpc scripts to deploy a cluster and benchmarks are run. Before December 2020 we never had a issue doing it. But now the azhpc-build fails at the install_node_setup.sh step with the message "package epel-release is not installed epel-release-7-11.noarch".

To Reproduce Steps to reproduce the behavior:

  1. Have a CentOS-HPC VHD at your disposal.
  2. Download it on to a storage account and create an image out of it.
  3. Utilize the azhpc scripts and the image to deploy a cluster.
  4. You should encounter the error here.

Expected behavior As before Dec 2020, the azhpc-build should be able to deploy a cluster out of the image.

Screenshots image

Configuration (please complete the following information):

xpillons commented 3 years ago

@edwardsp can you please have a look ?

edwardsp commented 3 years ago

@souvik-de this is just failing as you are unable to ssh from the jumpbox to the compute instance. Have you tried to access the VMSS instance yourself (as you are able to connect to the jumpbox)? Also, does this happen consistently or just occasionally?

souvik-de commented 3 years ago

I cannot ssh into the headnode even after resetting with password - "Permission denied (publickey,gssapi-keyex,gssapi-with-mic) | Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password)". Happens consistently.