GoogleCloudPlatform / batch-samples

56 stars 28 forks source link

Issues with batch MPI examples #29

Open vsoch opened 1 year ago

vsoch commented 1 year ago

Hi!

I am trying to reproduce the simple MPI example here, because actually running an mpi program, because the example here is just running hostname. I have locally two examples - one an application we are working on, and the second a "hello world" example that I fell back to when I hit some issues (and it reproduced them). Here is what my job looks like:

name: "projects/llnl-flux/locations/us-central1/jobs/hello-world-mpi-005"
uid: "hello-world-mpi-00-3f853428-1bba-44c60"
task_groups {
  name: "projects/xxxxxxxxxxxxxxxlocations/us-central1/jobs/hello-world-mpi-005/taskGroups/group0"
  task_spec {
    runnables {
      barrier {
        name: "wait-for-setup"
      }
    }
    runnables {
      script {
        text: "bash /mnt/share/hello-world-mpi/setup.sh"
      }
    }
    runnables {
      barrier {
        name: "wait-for-setup"
      }
    }
    runnables {
      script {
        text: "bash /mnt/share/hello-world-mpi/run.sh"
      }
    }
    compute_resource {
      cpu_milli: 1000
      memory_mib: 1000
    }
    max_run_duration {
      seconds: 3600
    }
    max_retry_count: 2
    volumes {
      gcs {
        remote_path: "netmark-experiment-bucket"
      }
      mount_path: "/mnt/share"
    }
  }
  task_count: 4
  parallelism: 4
  task_count_per_node: 1
  require_hosts_file: true
  permissive_ssh: true
}
allocation_policy {
  location {
    allowed_locations: "regions/us-central1"
    allowed_locations: "zones/us-central1-a"
    allowed_locations: "zones/us-central1-b"
    allowed_locations: "zones/us-central1-c"
    allowed_locations: "zones/us-central1-f"
  }
  instances {
    policy {
      machine_type: "c2-standard-16"
      boot_disk {
        image: "projects/cloud-hpc-image-public/global/images/family/hpc-centos-7"
      }
    }
  }
  service_account {
    email: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  }
  labels {
    key: "batch-job-id"
    value: "hello-world-mpi-005"
  }
}
labels {
  key: "type"
  value: "script"
}
labels {
  key: "mount"
  value: "bucket"
}
labels {
  key: "env"
  value: "testing"
}
status {
  state: QUEUED
  run_duration {
  }
}
create_time {
  seconds: 1684889759
  nanos: 883261744
}
update_time {
  seconds: 1684889759
  nanos: 883261744
}
logs_policy {
  destination: CLOUD_LOGGING
}

And the setup.sh and run.sh scripts

setup.sh

#!/bin/bash
export DEBIAN_FRONTEND=noninteractive
sleep $BATCH_TASK_INDEX

# Note that for this family / image, we are root (do not need sudo)
yum update -y && yum install -y cmake gcc tuned ethtool

# This ONLY works on the hpc-* image family images
google_mpi_tuning --nosmt
# google_install_mpi --intel_mpi
google_install_intelmpi --impi_2021

# This is where they are installed to
# ls /opt/intel/mpi/latest/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release

export PATH=/opt/intel/mpi/latest/bin:$PATH

outdir=/mnt/share/hello-world-mpi
mkdir -p ${outdir}
cd ${outdir}

if [ $BATCH_TASK_INDEX = 0 ]; then
    wget -O /tmp/ompi.tar.gz https://docs.it4i.cz/src/ompi/ompi.tar.gz
    cd /tmp
    tar -xzvf ompi.tar.gz
    rm ompi/Makefile
    cp -R ./ompi/* ${outdir}/
    cd ${outdir}/
    ls
    mpicc -g -lmpi -lmpifort hello_c.c -I/opt/intel/mpi/latest/include -I/opt/intel/mpi/2021.8.0/include -L/opt/intel/mpi/2021.8.0/lib/release -L/opt/intel/mpi/2021.8.0/lib -o hello_c
fi

and run.sh

#!/bin/bash
export PATH=/opt/intel/mpi/latest/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/mpi/latest/lib:/opt/intel/mpi/latest/lib/release
find /opt/intel -name mpicc

if [ $BATCH_TASK_INDEX = 0 ]; then
  cd /mnt/share/hello-world-mpi
  ls
  mpirun -hostfile $BATCH_HOSTS_FILE -n 4 -ppn 1 -- /mnt/share/hello-world-mpi/hello_c
fi

It looks like it's compiling OK - I see hello_c - but the error I've hit in both with mpirun is something related to hydra and an argument?

image

It's been really challenging figuring out how all this works - e.g., it took me a hot minute to realize that these google install commands for mpi were only available on that specific image family, and then it's taken 10+ jobs to find paths / bins of various things (I'm on my 50+ run and still don't have a working thing!) :laughing: I have a lot of feedback I'm planning to share, but would like to get at least one reasonable example working first (and I'd be happy to share)! For my execution, I'm using the python sdk so I don't have the config beyond what I posted above. Thanks for the help - looking forward to getting this working!

vsoch commented 1 year ago

heyo! I got everything working - let me know if you are interested in an example here: https://github.com/converged-computing/operator-experiments/tree/main/google/networking/hello-world-mpi. I think this would be important to show folks - the issue is that the install scripts just show a source command for the vars.sh (and it doesn't actually run it) and I suspect many folks will assume it is sourced and run into hours / days of anguish debugging. :laughing: