JiaweiZhuang / cloud-gchp-paper

Code to reproduce GCHP-on-cloud paper
MIT License
5 stars 3 forks source link

pcluster upgrade #7

Open tarankalra1 opened 4 years ago

tarankalra1 commented 4 years ago

Hi Jiawei,

We have followed your instructions to get the aws parallel cluster setup which was successfully built up. In the first attempt, our configuration of pcluster configuration did not include " enable-efa=compute".

Then we tried to upgrade the cluster from command line, using "pcluster upgrade clustername"

We got errors with the message " network in use" which showed the "network interface id" so the upgrade was unsuccessful.

We then stopped the compute nodes using "pcluster stop" command and shut the master node down through the AWS console. So both master and compute node were shut after this point.

Even after that the pcluster upgrade showed the same error message i.e. network in use. We also see that the cluster is "in use" under the network interfaces in EC2 dropdown options.

Do you have any idea of why the upgrade would not occur ?

rsignell-usgs commented 4 years ago

Our error is:

(aws) C:\Users\rsignell\.parallelcluster>pcluster status ghost
Status: UPDATE_ROLLBACK_COMPLETE
2020-04-03 17:06:57.017000+00:00 UPDATE_FAILED AWS::EC2::Instance MasterServer Interface: [eni-0d3d659d9ffbb32ff] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 213b1c78-2298-45bd-9033-53be181bbc20)
2020-04-03 17:04:57.711000+00:00 UPDATE_FAILED AWS::EC2::Instance MasterServer Interface: [eni-0d3d659d9ffbb32ff] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 7968c1f3-25df-4efb-8829-383057279b28)
2020-04-03 16:59:37.508000+00:00 UPDATE_FAILED AWS::EC2::Instance MasterServer Interface: [eni-0d3d659d9ffbb32ff] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 1b844e51-bfd0-48b5-a76e-82070a3332cb)

and indeed, when we check the AWS console, we find that that network interface is "in use" -- by the Master node which is stopped!

Perhaps the problem is that we have been stopping the Master node via the console?

Is there a cleaner way to "stop" the entire cluster? (not just the compute node)

JiaweiZhuang commented 4 years ago

Then we tried to upgrade the cluster from command line, using "pcluster upgrade clustername"

For this kind of major changes I would create a new cluster instead of using pcluster upgrade. The Spack directory can be copied to the new cluster.

Is there a cleaner way to "stop" the entire cluster? (not just the compute node)

I also stop the master instanc in console. From aws/aws-parallelcluster#1053 I think this is a valid approach.

For pcluster issues it's probably better to post at https://github.com/aws/aws-parallelcluster/issues to get more official answers 😃

rsignell-usgs commented 4 years ago

Thanks @JiaweiZhuang. Good to know we weren't just doing something dumb. Knowing that, we will proceed to ask on over on Parallel Cluster issues.