clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

Creating Oracle GPU Nodes #40

Closed p-1603 closed 2 years ago

p-1603 commented 3 years ago

I have been trying to create GPU nodes for use on an Oracle CITC cluster. Following the documentation, I have been able to create node images. The packer configuration file seems to be all.variables.pkr rather than config.json, as stated in the docs, and I also have had to reformat the oracle-gpu variables as follows:

source "oracle-oci" "oracle-gpu" {
    image_name          = "${var.destination_image_name}-${var.cluster}-GPU-v{{timestamp}}"
    availability_domain = "${var.oracle_availability_domain}"
    base_image_ocid     = "${var.oracle_base_image_ocid_gpu}"
    compartment_ocid    = "${var.oracle_compartment_ocid}"
    shape               = "${var.oracle_shape_gpu}"
    subnet_ocid         = "${var.oracle_subnet_ocid}"
    access_cfg_file     = "${var.oracle_access_cfg_file}"
    key_file            = "${var.oracle_key_file}"
    tags = {
        cluster = var.cluster
    }
    ssh_username = "opc"
}

However, I have also noticed that the /etc/citc/shapes.yaml contains VM GPU shapes already. I used one of these to create several nodes, but have found that they either do not accept jobs, or get stuck at the 'configuring' stage. I was able successfully to submit jobs to one of these nodes after holding and releasing jobs with Slurm control, but have not since been able to replicate this with any others. I would appreciate any help in working out how to create working GPU nodes consistently.

milliams commented 2 years ago

I can reproduce this issue and it's being caused by some conflicts between some changes that I made and some that Oracle made on their side. I think I know how to resolve it, and once it is, it will be simpler to use.

milliams commented 2 years ago

CitC now just build a single image type for all Oracle nodes. If you want to use the GPUs on a GPU nodes, then you'll need to build that into the image. There's information on this at https://cluster-in-the-cloud.readthedocs.io/en/latest/running.html#gpu-nodes