GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

[spark-rapids] Update MIG script #1102

Closed SurajAralihalli closed 1 year ago

SurajAralihalli commented 1 year ago

This PR updates the MIG script to use the latest driver installation method and also addresses the following issues

  1. Supported Linux Distros:

    • Continued support for three Linux distro families: Debian 10, Debian 11, Ubuntu 18, Ubuntu 20 and Rocky 8
    • Adds support for Ubuntu 22, Rocky 9 (for future Dataproc releases)
  2. Default Driver Version Update:

    • Previous default driver version 495.29.05 coupled with CUDA 11.5.
    • L4 Gpus require a minimum of 525 and failed with existing init script
    • New default driver version: 535 (535.104.05) for all three operating systems allows support for L4 Gpus.
    • Cuda 12.1.1 - Driver v530.30.02 is used for Ubuntu 18 only
  3. Improved CUDA Driver Installation:

    • Transitioned from run files to package manager for NVIDIA driver installation on Ubuntu and Debian
    • Prevents driver failures caused by unexpected kernel updates.
      • Leverages precompiled kernel modules on Rocky
      • Recompiles kernel modules on Debian and Ubuntu on installing new kernel headers
  4. Systemd Service for Kernel Headers on Debian and Ubuntu:

    • Introduced systems install-headers.service.
    • Installs new kernel headers (if any) only after system reboot.
    • Ensures effective recompilation of kernel modules required for NVIDIA drivers.

Signed-off-by: Suraj Aralihalli suraj.ara16@gmail.com

SurajAralihalli commented 1 year ago

@jayadeep-jayaraman I haven't been able to launch a cluster with A100 on Dataproc (due to limited/no availability) to test the MIG functionality of this script. Is there a CI/CD job that tests this?

cc: @viadea

jayadeep-jayaraman commented 1 year ago

Better to use L4 instances. A100 is very hard to get at the moment.

SurajAralihalli commented 1 year ago

L4 GPUs don't support MIG (https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus).

jayadeep-jayaraman commented 1 year ago

If i remember correctly either @nvliyuan / @viadea had mentioned that MIG is not a common feature and also in the spark rapids documentation it is mentioned that MIG is not recommended. Therefore, can we remove this feature ?

viadea commented 1 year ago

@jayadeep-jayaraman Let's keep this feature for now.

jayadeep-jayaraman commented 1 year ago

/gcbrun

jayadeep-jayaraman commented 1 year ago

The tests have passed, merging this change