Closed verdurin closed 4 years ago
Yes, changing shapes (machine types
in Google terms) is supported. The process is, as you tried: change the limits.yaml
file and re-run finish
.
Unfortunately the list of shapes is currently hardcoded. It is in google-cloud-platform/files/shapes.yaml. At cluster creation time it gets copied on to the management node in /etc/citc/shapes.yaml
.
So a workaround would be to edit the file
[provisioner@mgmt ~]$ sudo vim /etc/citc/shapes.yaml
And add a block, for example
n1-standard-8:
memory: 29000
cores_per_socket: 4
threads_per_core: 2
And rerun
[provisioner@mgmt ~]$ finish
This rewrites the node specifications in /mnt/shared/etc/slurm/slurm.conf
that the slurm controller uses to check with slurm daemon on the compute node when it comes up that it has sufficient resources.
In theory, we should be able to get the required information via an API call like machineTypes.list
$ jq '[.items][][] | select (.name=="n1-standard-4" or .name=="n1-standard-8") | {name, memoryMb, guestCpus}' < types.json
Gives:
{
"name": "n1-standard-4",
"memoryMb": 15360,
"guestCpus": 4
}
{
"name": "n1-standard-8",
"memoryMb": 30720,
"guestCpus": 8
}
Whereas a freshly booted n1-standard-4-0001
has
[citc@n1-standard-4-0001 ~]$ free -m
total used free shared buff/cache available
Mem: 14876 364 13669 8 842 14208
Swap: 0 0 0
[citc@n1-standard-4-0001 ~]$ lscpu | grep -E '^CPU\(s|^Thread|^Core|^Socket'
CPU(s): 4
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
and on a n1-standard-8-0001
[citc@n1-standard-8-0001 ~]$ free -m
total used free shared buff/cache available
Mem: 29994 527 28621 8 846 29103
Swap: 0 0 0
[citc@n1-standard-8-0001 ~]$ lscpu | grep -E '^CPU\(s|^Thread|^Core|^Socket'
CPU(s): 8
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
So we have to "derate" the memory somewhat (~5%) and figure out what the "topology" (threads, cores, sockets) is.
At the moment we are doing this empirically by booting a node and seeing what we get.
I've made pull request #50 you could pull the shapes.yaml out of that in the meantime in case you are blocked.
@christopheredsall thanks, I also spotted that there was a branch addressing this, after I posted the issue. I've tried the PR and it does work, as expected.
When running
finish
:Error: Could not find shape information for 'n1-standard-8'.
This was on an already provisioned cluster which I was hoping to change. This may not be a supported use-case?
I wanted to use this machine type because it is recommended for
Filestore
clients:https://cloud.google.com/filestore/docs/performance#client-machine