GoogleCloudPlatform / cluster-toolkit

Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
Apache License 2.0
190 stars 125 forks source link

slurm-gcp-v6-controller / pre-existing-network-storage - '$controller' not added to mounts #2869

Closed scott-nag closed 1 month ago

scott-nag commented 1 month ago

Describe the bug

Module scripts located in community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/ (develop branch)

I am creating a v6 cluster using pre-existing-network-storage with the server_ip set to $controller in the blueprint. However the start up script fails to mount the storage and times out

[root@clusterb6b-controller ~]# tail -f /slurm/scripts/setup.log 
run: ['create-munge-key', '-f']
run: ['systemctl', 'restart', 'munge']
Set up network storage
Temporary failure in name resolution, retrying in 1
Temporary failure in name resolution, retrying in 2
Temporary failure in name resolution, retrying in 4
Temporary failure in name resolution, retrying in 8
Temporary failure in name resolution, retrying in 16
Temporary failure in name resolution, retrying in 32
Temporary failure in name resolution, retrying in 64
Temporary failure in name resolution, retrying in 128
Temporary failure in name resolution, retrying in 256
[Errno -2] Name or service not known
Traceback (most recent call last):
  File "/slurm/scripts/setup.py", line 494, in <module>
    main()
  File "/slurm/scripts/setup.py", line 468, in main
    {
  File "/slurm/scripts/setup.py", line 335, in setup_controller
    setup_network_storage(log)
  File "/slurm/scripts/setup_network_storage.py", line 100, in setup_network_storage
    ext_mounts, int_mounts = separate_external_internal_mounts(all_mounts)
  File "/slurm/scripts/setup_network_storage.py", line 91, in separate_external_internal_mounts
    return separate(internal_mount, mounts)
  File "/slurm/scripts/util.py", line 698, in separate
    return reduce(lambda acc, el: acc[pred(el)].append(el) or acc, coll, ([], []))
  File "/slurm/scripts/util.py", line 698, in <lambda>
    return reduce(lambda acc, el: acc[pred(el)].append(el) or acc, coll, ([], []))
  File "/slurm/scripts/setup_network_storage.py", line 88, in internal_mount
    mount_addr = util.host_lookup(server_ip)
  File "/slurm/scripts/util.py", line 687, in wrapper
    raise captured_exc
  File "/slurm/scripts/util.py", line 680, in wrapper
    return f(*args, **kwargs)
  File "/slurm/scripts/util.py", line 1160, in host_lookup
    return socket.gethostbyname(host_name)
socket.gaierror: [Errno -2] Name or service not known
Aborting setup...
run: ['wall', '-n', '*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***']

*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***

I believe the server_ip in the storage mounts should contain the host name instead of $controller, similar to how the second mount shows cluster9f3-controller successfully here?

Resolved network storage mounts: [{'fs_type': 'nfs', 'local_mount': '/home', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/home', 'server_ip': '$controller'}, {'server_ip': 'cluster9f3-controller', 'remote_mount': '/opt/apps', 'local_mount': '/opt/apps', 'fs_type': 'nfs', 'mount_options': 'defaults,hard,intr'}, {'fs_type': 'nfs', 'local_mount': '/opt/cluster', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/opt/cluster', 'server_ip': '$controller'}]
Separating external and internal mounts
Checking if mount is internal: {'fs_type': 'nfs', 'local_mount': '/home', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/home', 'server_ip': '$controller'}
Temporary failure in name resolution, retrying in 1
Temporary failure in name resolution, retrying in 2
Temporary failure in name resolution, retrying in 4
...

Steps to reproduce

  1. Create a VPC and subnet
  2. Create cluster using the blueprint
  3. Check the instances logs

Expected behavior

Storage should be successfully mounted and timeout should not happen

Actual behavior

Timeouts as shown in the above logs

Version (gcluster --version)

gcluster version - not built from official release Built from 'develop' branch. Commit info: v1.37.1-167-g1d7dc338-dirty Terraform version: 1.9.3

(tested with Terraform 1.4 too)

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

blueprint_name: cluster-b6be43f5

vars:
  project_id: ofetest
  deployment_name: cluster-b6be43f5
  region: us-central1
  zone: us-central1-c
  enable_cleanup_compute: True
  enable_bigquery_load: False
  instance_image_custom: True
  labels:
    created_by: testofe-server

deployment_groups:
- group: primary
  modules:
  - source: modules/network/pre-existing-vpc
    kind: terraform
    settings:
      network_name: proper-hound-network
      subnetwork_name: proper-hound-subnet-2
    id: hpc_network

  - source: modules/file-system/pre-existing-network-storage
    kind: terraform
    id: mount_num_1
    settings:
      server_ip: '$controller'
      remote_mount: /opt/cluster
      local_mount: /opt/cluster
      mount_options: defaults,nofail,nosuid
      fs_type: nfs

  - source: modules/file-system/pre-existing-network-storage
    kind: terraform
    id: mount_num_2
    settings:
      server_ip: '$controller'
      remote_mount: /home
      local_mount: /home
      mount_options: defaults,nofail,nosuid
      fs_type: nfs

  - source: community/modules/project/service-account
    kind: terraform
    id: hpc_service_account
    settings:
      project_id: ofetest
      name: sa
      project_roles:
      - compute.instanceAdmin.v1
      - iam.serviceAccountUser
      - monitoring.metricWriter
      - logging.logWriter
      - storage.objectAdmin
      - pubsub.admin
      - compute.securityAdmin
      - iam.serviceAccountAdmin
      - resourcemanager.projectIamAdmin
      - compute.networkAdmin

  - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    kind: terraform
    id: partition_1
    use:
    - partition_1-nodeset
    settings:
      partition_name: batch
      exclusive: True
      resume_timeout: 500

  - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    id: partition_1-nodeset
    use:
    - mount_num_1
    - mount_num_2
    settings:
      bandwidth_tier: platform_default
      subnetwork_self_link: "projects/ofetest/regions/us-central1/subnetworks/proper-hound-subnet-2"
      enable_smt: False
      enable_placement: False
      machine_type: c2-standard-4
      node_count_dynamic_max: 1
      node_count_static: 0
      disk_size_gb: 50
      disk_type: pd-standard

  - source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    kind: terraform
    id: slurm_controller
    settings:
      cloud_parameters:
        resume_rate: 0
        resume_timeout: 500
        suspend_rate: 0
        suspend_timeout: 300
        no_comma_params: false
      machine_type: n2-standard-2
      disk_type: pd-standard
      disk_size_gb: 120
      service_account_email: $(hpc_service_account.service_account_email)
      service_account_scopes:
        - https://www.googleapis.com/auth/cloud-platform
        - https://www.googleapis.com/auth/monitoring.write
        - https://www.googleapis.com/auth/logging.write
        - https://www.googleapis.com/auth/devstorage.read_write
        - https://www.googleapis.com/auth/pubsub
      controller_startup_script: |
        #!/bin/bash
        echo "******************************************** CALLING CONTROLLER STARTUP"
      compute_startup_script: |
        #!/bin/bash
        echo "******************************************** CALLING COMPUTE STARTUP"
      login_startup_script: |
        #!/bin/bash
        echo "******************************************** CALLING LOGIN STARTUP"
    use:
    - slurm_login
    - hpc_network
    - partition_1
    - mount_num_1
    - mount_num_2

  - source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
    kind: terraform
    id: slurm_login
    settings:
      num_instances: 1
      subnetwork_self_link: "projects/ofetest/regions/us-central1/subnetworks/proper-hound-subnet-2"
      machine_type: n2-standard-2
      disk_type: pd-standard
      disk_size_gb: 120
      service_account_email: $(hpc_service_account.service_account_email)
      service_account_scopes:
        - https://www.googleapis.com/auth/cloud-platform
        - https://www.googleapis.com/auth/monitoring.write
        - https://www.googleapis.com/auth/logging.write
        - https://www.googleapis.com/auth/devstorage.read_write

Output and logs

N/A - blueprint is successfully deployed

Execution environment

Other info

I have added a quick fix to the resolve_network_storage function that is located in setup_network_storage.py (the "for mount in mounts.values()" loop) as I noticed similar logic relating to $controller in util.py

def resolve_network_storage(nodeset=None):
    """Combine appropriate network_storage fields to a single list"""

    if lkp.instance_role == "compute":
        try:
            nodeset = lkp.node_nodeset()
        except Exception:
            # External nodename, skip lookup
            nodeset = None

    # seed mounts with the default controller mounts
    if cfg.disable_default_mounts:
        default_mounts = []
    else:
        default_mounts = [
            NSDict(
                {
                    "server_ip": lkp.control_addr or lkp.control_host,
                    "remote_mount": str(path),
                    "local_mount": str(path),
                    "fs_type": "nfs",
                    "mount_options": "defaults,hard,intr",
                }
            )
            for path in (
                dirs.home,
                dirs.apps,
            )
        ]

    # create dict of mounts, local_mount: mount_info
    mounts = mounts_by_local(default_mounts)

    # On non-controller instances, entries in network_storage could overwrite
    # default exports from the controller. Be careful, of course
    mounts.update(mounts_by_local(cfg.network_storage))
    if lkp.instance_role in ("login", "controller"):
        mounts.update(mounts_by_local(cfg.login_network_storage))

    if nodeset is not None:
        mounts.update(mounts_by_local(nodeset.network_storage))

    # Replace $controller with the actual hostname in all mounts
    for mount in mounts.values():
        if mount['server_ip'] == '$controller':
            mount['server_ip'] = cfg.slurm_control_host

    return list(mounts.values())

This successfully gets the startup scripts to run and gets the login and controller nodes online

Setting up network storage
Resolving network storage
Resolved network storage mounts: [{'fs_type': 'nfs', 'local_mount': '/home', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/home', 'server_ip': 'clusterad2-controller'}, {'server_ip': 'clusterad2-controller', 'remote_mount': '/opt/apps', 'local_mount': '/opt/apps', 'fs_type': 'nfs', 'mount_options': 'defaults,hard,intr'}, {'fs_type': 'nfs', 'local_mount': '/opt/cluster', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/opt/cluster', 'server_ip': 'clusterad2-controller'}]
External mounts: [], Internal mounts: [{'fs_type': 'nfs', 'local_mount': '/home', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/home', 'server_ip': 'clusterad2-controller'}, {'server_ip': 'clusterad2-controller', 'remote_mount': '/opt/apps', 'local_mount': '/opt/apps', 'fs_type': 'nfs', 'mount_options': 'defaults,hard,intr'}, {'fs_type': 'nfs', 'local_mount': '/opt/cluster', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/opt/cluster', 'server_ip': 'clusterad2-controller'}]
Instance is controller, using external mounts
Creating backup of fstab
Restoring fstab from backup
Mounting fstab entries
Handling munge mount
About to run custom scripts
Determined custom script directories: [PosixPath('/slurm/custom_scripts/controller.d')]
Collected custom scripts: [PosixPath('/slurm/custom_scripts/controller.d/ghpc_startup.sh')]
Custom scripts to run: /slurm/custom_scripts/(controller.d/ghpc_startup.sh)
Processing script: /slurm/custom_scripts/controller.d/ghpc_startup.sh
Running script ghpc_startup.sh with timeout=300
run: /slurm/custom_scripts/controller.d/ghpc_startup.sh
ghpc_startup.sh returncode=0
stdout=******************************************** CALLING CONTROLLER STARTUP
This is the startup script for the controller on cluster 3

Unfortunately Slurm still isn't configured correctly as shown below, so it is possibly not being replaced elsewhere in the module too. Happy to provide more info if required.

s[root@clusterad2-controller ~]# srun -p batch hostname
srun: Required node not available (down, drained or reserved)
...
[root@clusterad2-controller ~]# cat /var/log/slurm/slurmdbd.log
...
[2024-08-05T14:38:07.093] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-08-05T14:38:07.093] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2024-08-05T14:39:06.000] SchedulerParameters=bf_continue,salloc_wait_nodes,ignore_prefer_validation
[2024-08-05T14:42:08.506] sched: _slurm_rpc_allocate_resources JobId=1 NodeList=clusterad2-partition1node-0 usec=2040
[2024-08-05T14:42:09.769] _update_job: setting admin_comment to GCP Error: Permission denied on locations/{} (or it may not exist). for JobId=1
[2024-08-05T14:42:09.769] _slurm_rpc_update_job: complete JobId=1 uid=981 usec=112
[2024-08-05T14:42:09.780] update_node: node clusterad2-partition1node-0 reason set to: GCP Error: Permission denied on locations/{} (or it may not exist).
[2024-08-05T14:42:09.780] Killing JobId=1 on failed node clusterad2-partition1node-0
[2024-08-05T14:42:09.780] update_node: node clusterad2-partition1node-0 state set to DOWN
mr0re1 commented 1 month ago

Hi @scott-nag , thank you for reporting! To be fixed by https://github.com/GoogleCloudPlatform/slurm-gcp/pull/194

scott-nag commented 1 month ago

this is working now perfectly, thank you for the quick fix!

rohitramu commented 1 month ago

https://github.com/GoogleCloudPlatform/slurm-gcp/pull/194 is included in the latest release.