clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

Can't access graviton nodes... #33

Closed harrywaugh closed 3 years ago

harrywaugh commented 3 years ago

Hi,

I've setup CITC via your docs, https://cluster-in-the-cloud.readthedocs.io/en/latest/infrastructure.html, but I can't seem to allocate any jobs on the Graviton 2 instance.

These were the commands I ran..

Local machine

 git clone https://github.com/clusterinthecloud/installer.git
 cd installer

 ./install-citc.sh aws

 scp -i citc-terraform-trusting-boa/citc-key ~/.ssh/mac.pub citc@34.254.239.9:

CITC@MGMT

 ssh -i citc-terraform-trusting-boa/citc-key citc@34.254.239.9
 echo "c6g.metal: 1" > limits.yaml
 finish # Waited for setup to finish
 sudo /usr/local/sbin/add_user_ldap harrywaugh Harry Waugh file:///home/citc/mac.pub
 sudo /usr/local/bin/run-packer aarch64
 exit # Let script finish before exiting.

On harrywaugh@MGMT

 ssh harrywaugh@34.254.239.9
 srun --cpus-per-task 64 --time 8:00:00 --pty bash
 #    srun: error: Node failure on trusting-boa-c6g-metal-0001
 #    srun: Force Terminated job 2
 #    srun: error: Job allocation 2 has been revoked
 # In AWS instances trusting-boa-c6g-metal-0001 is running and 2/2 status checks have passed
 srun --cpus-per-task 64 --time 8:00:00 --pty bash
 #    srun: Required node not available (down, drained or reserved)
 #    srun: job 3 queued and waiting for resources
 #    Hangs..
 # IN AWS instances rusting-boa-c6g-metal-0001 is shutting down.
[harrywaugh@mgmt ~]$ list_nodes
NODELIST                                STATE       REASON                        CPUS S:C:T   MEMORY    AVAIL_FEATURES                          GRES                NODE_ADDR           TIMESTAMP
trusting-boa-c6g-metal-0001             down~       ResumeTimeout reached         64   1:64:1  127133    

sosreport-mgmt-harrywaugh-2021-01-26-wosyqag.tar.gz

milliams commented 3 years ago

The first thing to check is whether the node was indeed started. Have a look in the AWS EC2 console and see if it ever made your trusting-boa-c6g-metal-0001 instance. Most likely it did but it just took a long time, more than the default time-out. If the instance is there in the console then run:

[citc@mgmt ~]$ sudo scontrol update nodename=trusting-boa-c6g-metal-0001 state=resume
harrywaugh commented 3 years ago

Ah fantastic! Thanks Matt. I'd only tried the POWER_UP and POWER_DOWN status codes... 🤦