clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

Error changing node type #48

Closed verdurin closed 4 years ago

verdurin commented 4 years ago

When running finish:

Error: Could not find shape information for 'n1-standard-8'.

This was on an already provisioned cluster which I was hoping to change. This may not be a supported use-case?

I wanted to use this machine type because it is recommended for Filestore clients:

https://cloud.google.com/filestore/docs/performance#client-machine

christopheredsall commented 4 years ago

Workaround

Yes, changing shapes (machine types in Google terms) is supported. The process is, as you tried: change the limits.yaml file and re-run finish.

Unfortunately the list of shapes is currently hardcoded. It is in google-cloud-platform/files/shapes.yaml. At cluster creation time it gets copied on to the management node in /etc/citc/shapes.yaml.

So a workaround would be to edit the file

[provisioner@mgmt ~]$ sudo vim /etc/citc/shapes.yaml

And add a block, for example

n1-standard-8:
  memory: 29000
  cores_per_socket: 4
  threads_per_core: 2

And rerun

[provisioner@mgmt ~]$ finish

This rewrites the node specifications in /mnt/shared/etc/slurm/slurm.conf that the slurm controller uses to check with slurm daemon on the compute node when it comes up that it has sufficient resources.

Background

In theory, we should be able to get the required information via an API call like machineTypes.list

$ jq '[.items][][] | select (.name=="n1-standard-4" or .name=="n1-standard-8") | {name, memoryMb, guestCpus}' < types.json

Gives:

{
  "name": "n1-standard-4",
  "memoryMb": 15360,
  "guestCpus": 4
}
{
  "name": "n1-standard-8",
  "memoryMb": 30720,
  "guestCpus": 8
}

Whereas a freshly booted n1-standard-4-0001 has

[citc@n1-standard-4-0001 ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:          14876         364       13669           8         842       14208
Swap:             0           0           0
[citc@n1-standard-4-0001 ~]$ lscpu | grep -E '^CPU\(s|^Thread|^Core|^Socket'
CPU(s):                4
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1

and on a n1-standard-8-0001

[citc@n1-standard-8-0001 ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:          29994         527       28621           8         846       29103
Swap:             0           0           0
[citc@n1-standard-8-0001 ~]$ lscpu | grep -E '^CPU\(s|^Thread|^Core|^Socket'
CPU(s):                8
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1

So we have to "derate" the memory somewhat (~5%) and figure out what the "topology" (threads, cores, sockets) is.

At the moment we are doing this empirically by booting a node and seeing what we get.

christopheredsall commented 4 years ago

I've made pull request #50 you could pull the shapes.yaml out of that in the meantime in case you are blocked.

verdurin commented 4 years ago

@christopheredsall thanks, I also spotted that there was a branch addressing this, after I posted the issue. I've tried the PR and it does work, as expected.