clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

Specify custom images for compute nodes? #25

Closed mikeoconnor0308 closed 3 years ago

mikeoconnor0308 commented 5 years ago

Looking at the latest version on master, it's no longer clear how to specify images for the compute nodes?

I'm looking to use GPU nodes. Since the management node and GPU nodes will need to use different images, I need to be able to specify a custom image for the GPU nodes.

milliams commented 5 years ago

We no longer allow arbitrary choice of image for the system as it makes it almost impossible to ensure that things are configured as we need. To this end, we use a fixed release of Oracle Linux on OCI clusters.

However, we still select a GPU image for GPU nodes. This is performed automatically in the start node process.

If this is not working correctly for you and you are not seeing the image selected as you need, please let us know.

mikeoconnor0308 commented 5 years ago

ok that makes sense.

This may be a naive question, but what's the best way to install software and its dependencies onto the shared file system? Do users have it all added to their relevant paths automatically?

Specifically, I'm going to be adding a precompiled installation of an MD package to the shared file system, and need to ensure any dependencies are also available on all nodes.

I was thinking the easiest thing to do would be to have images with those dependencies installed via yum. Given that's not an option, what's the easiest way to do it?

milliams commented 5 years ago

We're currently working on a better system of allowing users to build their entire software stack using a tool like EasyBuild, but in the mean time I will add a feature which at least installs Lmod and points it at a directory in the shared filesystem so that compute nodes can access it.

If you could update your local checkout of the oci-cluster-terraform repo as we've just pushed a new version, 3, which is where these new features will be added.

I recommend putting all the dependencies you need on the shared filesystem and creating a module file which sets the paths accordingly.

I will update this issue once I've added in the Lmod install.

milliams commented 5 years ago

There's a PR in for Lmod support at ACRC/slurm-ansible-playbook#37. Once that's merged, you'll be able to use module files placed in /mnt/shares/modules/all from all nodes on your cluster.

milliams commented 5 years ago

I've now merged this in so if you create a new cluster it will come with Lmod installed so that you can put *.lua module files in /mnt/share/manual_modules/ as seen in the Lmod docs.

milliams commented 3 years ago

This is in a working state for a while now. Also, we now build custom images which can be configured as described at https://cluster-in-the-cloud.readthedocs.io/en/latest/running.html#configuring-node-images