Blog: https://github.com/ObrienlabsDev/machine-learning/issues/13 Branch: https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/tree/747-gpu

Configurations

H100

GH100, 80G, 16896 cores, 528 tc, 5120 bus, 1680gbps, 700w, CUDA 9.0

gcloud compute instances create instance-20240227-001029 --project=cuda-old --zone=europe-west4-b --machine-type=a3-highgpu-8g --network-interface=network-tier=PREMIUM,nic-type=GVNIC,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=8,type=nvidia-h100-80gb --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=instance-20240227-001029,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20240128-debian-11-py310,mode=rw,size=200,type=projects/cuda-old/zones/europe-west4-b/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Amsterdam

The GPUS-ALL-REGIONS-per-project quota maximum has been exceeded. Current limit: 4.0. Metric: compute.googleapis.com/gpus_all_regions. More dimension(s): global=global

Thank you for submitting Case # (ID:7a243377658e40e2bd) to Google Cloud Platform support for the following quota:
Change GPUs (all regions) from 4 to 8

Unfortunately, we are unable to grant you additional quota at this time. If this is a new project please wait 48h until you resubmit the request or until your Billing account has additional history.

L4

`VM`	`GPU`	'cards`	region/zone
g2-standard-8	L4	1	us-central-1a
g2-standard-24	L4	2	us-east4-c

1) The A3 VM is the first VM with the Intel 200mbps IPU (not using NVidia's Infiniband)

https://www.intel.com/content/www/us/en/products/details/network-io/ipu.html from https://cloud.google.com/blog/products/compute/introducing-a3-supercomputers-with-nvidia-h100-gpus

Note: if the google supplied deep learning images based off Debian are not used - check your MTU settings - thanks Henry https://www.civo.com/learn/fixing-networking-for-docker

2) Start with prebuilt linux images from NVidia by CUDA version

https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/655
```
FROM nvidia/cuda:12.2.0-runtime-ubi9
```
or go directly to a TensorFlow capable image
```
FROM tensorflow/tensorflow:latest-gpu
```

Start with L4 (more available) via g2-standard-8 and move up to H100

do a parallel request for capacity of at least 1

Spin up a single and multi GPU L4 instance to test the tensorflow DockerFile with single and parallel strategy objects

Single

strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")

Multi

leave out the arg or specify

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"]

Quota

refer to quota exercise for L4 in https://github.com/ObrienlabsDev/blog/wiki/CUDA-based-%E2%80%90-High-Performance-Computing-%E2%80%90-LLM-Training-%E2%80%90-Ground-to-GCP-Cloud-Hybrid#get-additional-gpu-quota

quota is a part of a 2 step process - after your quota is approved, the zone may not have resources to service your request - retry or move zones
keep alternate zones around - the best strategy is to keep your VM/GPU up
quota exercise for V100/A100/H100 will involve your sales rep - try the following form to ask for H100 in parallel
https://docs.google.com/forms/d/e/1FAIpQLSfWP2weHCBj9AliES43_TA0LO4oOaP5sbGDWWPSbe-NaBuxJA/viewform
for example

L4 prototype for H100 prep - on us-east4-c using DL image

(base) michael@l4-2:~$ nvidia-smi
Thu Nov 30 19:35:58 2023       
-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA L4           Off  | 00000000:00:03.0 Off |                    0 |
| N/A   60C    P0    32W /  72W |      0MiB / 23034MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA L4           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    31W /  72W |      0MiB / 23034MiB |      7%      Default |
|                               |                      |                  N/A |

TensforFlow / Keras test ML training run

Run a standard concurrent saturation TensorFlow/Keras ML job from U of Toronto to check batch size optimums under 30 epochs to get close to 1.0 fitness - 25 avoids overfit

https://github.com/ObrienlabsDev/machine-learning

base) michael@l4-4-2:~$ git clone https://github.com/ObrienlabsDev/machine-learning.git
(base) michael@l4-4-2:~/machine-learning$ vi environments/windows/src/tflow.py 
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()

with strategy.scope():
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
# https://keras.io/api/models/model/
  parallel_model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)
  loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# https://keras.io/api/models/model_training_apis/
  parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
parallel_model.fit(x_train, y_train, epochs=30, batch_size=2048)#5120)#7168)#7168)

(base) michael@l4-4-2:~/machine-learning$ cat environments/windows/Dockerfile 
FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]

base) michael@l4-4-2:~/machine-learning$ ./build.sh 
Sending build context to Docker daemon  6.656kB
Step 1/4 : FROM tensorflow/tensorflow:latest-gpu
latest-gpu: Pulling from tensorflow/tensorflow

uccessfully tagged ml-tensorflow-win:latest
2023-11-30 20:29:26.443809: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-30 20:29:26.497571: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-30 20:29:26.497614: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-30 20:29:26.499104: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-30 20:29:26.506731: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-30 20:29:31.435829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20795 MB memory:  -> device: 0, name: NVIDIA L4, pci bus id: 0000:00:03.0, compute capability: 8.9
2023-11-30 20:29:31.437782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 20795 MB memory:  -> device: 1, name: NVIDIA L4, pci bus id: 0000:00:04.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 3s 0us/step
Epoch 1/30

023-11-30 20:30:19.985861: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:20.001134: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:29.957119: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f9c6bf3a4f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-30 20:30:29.957184: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.957192: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.965061: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701376230.063893      80 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

25/25 [==============================] - 71s 317ms/step - loss: 4.9465 - accuracy: 0.0418
Epoch 2/30
25/25 [==============================] - 4s 142ms/step - loss: 3.8430 - accuracy: 0.1214
Epoch 3/30
25/25 [==============================] - 4s 142ms/step - loss: 3.3694 - accuracy: 0.1967
Epoch 4/30
25/25 [==============================] - 4s 143ms/step - loss: 3.0832 - accuracy: 0.2544
Epoch 5/30
25/25 [==============================] - 4s 143ms/step - loss: 2.7049 - accuracy: 0.3326
Epoch 6/30
25/25 [==============================] - 4s 143ms/step - loss: 2.3329 - accuracy: 0.4119
Epoch 7/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9781 - accuracy: 0.4824
Epoch 8/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9177 - accuracy: 0.4948
Epoch 9/30
25/25 [==============================] - 4s 142ms/step - loss: 1.4980 - accuracy: 0.5937
Epoch 10/30
25/25 [==============================] - 4s 144ms/step - loss: 1.3247 - accuracy: 0.6322
Epoch 11/30
25/25 [==============================] - 4s 142ms/step - loss: 1.0408 - accuracy: 0.7063
Epoch 12/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9150 - accuracy: 0.7439
Epoch 13/30
25/25 [==============================] - 4s 143ms/step - loss: 0.8210 - accuracy: 0.7648
Epoch 14/30
25/25 [==============================] - 4s 142ms/step - loss: 0.5581 - accuracy: 0.8424
Epoch 15/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4635 - accuracy: 0.8709
Epoch 16/30
25/25 [==============================] - 4s 142ms/step - loss: 0.4771 - accuracy: 0.8610
Epoch 17/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9404 - accuracy: 0.7228
Epoch 18/30
25/25 [==============================] - 4s 143ms/step - loss: 0.5478 - accuracy: 0.8385
Epoch 19/30
25/25 [==============================] - 4s 143ms/step - loss: 0.4107 - accuracy: 0.8867
Epoch 20/30
25/25 [==============================] - 4s 143ms/step - loss: 0.2424 - accuracy: 0.9345
Epoch 21/30
25/25 [==============================] - 4s 146ms/step - loss: 0.1677 - accuracy: 0.9587
Epoch 22/30
25/25 [==============================] - 4s 142ms/step - loss: 0.1419 - accuracy: 0.9659
Epoch 23/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1861 - accuracy: 0.9510
Epoch 24/30
25/25 [==============================] - 4s 141ms/step - loss: 0.2771 - accuracy: 0.9264
Epoch 25/30
25/25 [==============================] - 4s 142ms/step - loss: 0.2663 - accuracy: 0.9326
Epoch 26/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1710 - accuracy: 0.9600
Epoch 27/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4977 - accuracy: 0.8626
Epoch 28/30
25/25 [==============================] - 4s 141ms/step - loss: 0.6559 - accuracy: 0.8100
Epoch 29/30
25/25 [==============================] - 4s 143ms/step - loss: 0.3074 - accuracy: 0.9105
Epoch 30/30
25/25 [==============================] - 4s 143ms/step - loss: 0.1834 - accuracy: 0.9515
(base) michael@l4-4-2:~/machine-learning$

Batch = 2048, epochs = 25
Epoch 24/25
25/25 [==============================] - 4s 144ms/step - loss: 0.2537 - accuracy: 0.9221
Epoch 25/25
25/25 [==============================] - 4s 145ms/step - loss: 0.2258 - accuracy: 0.9300

Procedure to bootstrap cuda/tensorflow on GCE VM

boot up a VM with 1 or more GPUs. In this case an g2-standard-24 linux VM with 2 L4 GPUs

CUDA drivers need to be installed

wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-debian11-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
#sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda

michael@l4-2a:~$ sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-alternative : Depends: glx-alternative-nvidia (>= 1.0) but it is not installable
E: Unable to correct problems, you have held broken packages.

rerun
michael@l4-2a:~$ sudo add-apt-repository contrib
sudo: add-apt-repository: command not found

michael@l4-2a:~$ sudo apt install software-properties-common

https://linuxhint.com/fix-sudo-add-apt-repository-command-not-found-error-linux-ubuntu/

hanging - but will take 5 min to complete

Processing triggers for dbus (1.12.28-0+deb11u1) ...

fixed
michael@l4-2a:~$ sudo add-apt-repository contrib
'contrib' distribution component enabled for all sources.

Rerun apt-get

sudo apt-get update
sudo apt-get -y install cuda

get a larger SSD
 cannot copy extracted data for './usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so' to '/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so.dpkg-new': failed to write (No space left on device)
Selecting previously unselected package cuda-cudart-dev-12-1.
Preparing to unpack .../196-cuda-cudart-dev-12-1_12.1.55-1_amd64.deb ...
dpkg: unrecoverable fatal error, aborting:
 unable to flush /var/lib/dpkg/updates/tmp.i after padding: No space left on device
E: Sub-process /usr/bin/dpkg returned an error code (2)

michael@l4-2a:~$ df
Filesystem      1K-blocks     Used Available Use% Mounted on
udev             49433116        0  49433116   0% /dev
tmpfs             9888784      476   9888308   1% /run
/dev/nvme0n1p1   10089736 10036060         0 100% /
tmpfs            49443920        0  49443920   0% /dev/shm
tmpfs                5120        0      5120   0% /run/lock
/dev/nvme0n1p15    126678    10900    115778   9% /boot/efi
tmpfs             9888784        0   9888784   0% /run/user/1000

remove the image
rw-r--r-- 1 michael michael 3247734826 Feb 23  2023 cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb

/dev/nvme0n1p1   10089736 6864436   2691184  72% /

sudo dpkg --configure -a
update-initramfs: Generating /boot/initrd.img-5.10.0-26-cloud-amd64
Errors were encountered while processing:
 cuda-libraries-12-1
 cuda-runtime-12-1
 cuda-drivers-530
 cuda-drivers

michael@l4-2a:~$ sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
 cuda : Depends: cuda-12-1 (>= 12.1.0) but it is not going to be installed
 cuda-cudart-dev-12-1 : Depends: cuda-cccl-12-1 but it is not going to be installed

sudo apt --fix-broken install

Setting up libnvidia-opticalflow1:amd64 (530.30.02-1) ...
Setting up libnvidia-encode1:amd64 (530.30.02-1) ...
Setting up cuda-drivers-530 (530.30.02-1) ...
Setting up cuda-drivers (530.30.02-1) ...
Setting up cuda-runtime-12-1 (12.1.0-1) ...
Processing triggers for glx-alternative-nvidia (1.2.1~deb11u1) ...
Processing triggers for glx-alternative-mesa (1.2.1~deb11u1) ...
Processing triggers for libc-bin (2.31-13+deb11u7) ...
Processing triggers for update-glx (1.2.1~deb11u1) ...
Processing triggers for glx-alternative-nvidia (1.2.1~deb11u1) ...
Processing triggers for libc-bin (2.31-13+deb11u7) ...
Processing triggers for initramfs-tools (0.140) ...
update-initramfs: Generating /boot/initrd.img-5.10.0-26-cloud-amd64

check nvidia-smi

rerun install after install fix

sudo apt-get -y install cuda

dpkg: unrecoverable fatal error, aborting:
 unable to flush /var/lib/dpkg/updates/tmp.i after padding: No space left on device
E: Sub-process /usr/bin/dpkg returned an error code (2)

10G full again - recreating the VM with an 80G drive
/dev/nvme0n1p1   10089736 10036372         0 100% /

Single or Dual L4

dual 24 cpus / 96g ram

4 cpus / 16g ram

Prototype dual L4 - g2-standard-24 for image generation

https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#console

gcloud compute instances create l4-2b --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-2b,image=projects/debian-cloud/global/images/debian-11-bullseye-v20231115,mode=rw,size=80,type=projects/cuda-old/zones/us-east4-c/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Install NVidia CUDA drivers https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local

with additional

sudo apt install software-properties-common

wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-debian11-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt install software-properties-common
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda
- note gui language install step
michael@l4-2b:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

before cuda
michael@l4-2b:~$ df
/dev/nvme0n1p1   82317152 8491284  70355856  11% /

after
/dev/nvme0n1p1   82317152 17862648  60984492  23% /

reboot

https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#script-limitations

sudo reboot now

Check NVidia driver

ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Nov 30 18:28:13 2023 from 35.235.241.34
michael@l4-2b:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

michael@l4-2b:~$ nvidia-settings -q NvidiaDriverVersion
Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run `nvidia-settings --help` for usage information.

michael@l4-2b:~$ nvidia-settings
Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run `nvidia-settings --help` for usage information.

michael@l4-2b:~$ cat /proc/driver/nvidia/version
cat: /proc/driver/nvidia/version: No such file or directory
michael@l4-2b:~$ ubuntu-drivers devices
-bash: ubuntu-drivers: command not found
michael@l4-2b:~$ apt search nvidia-driver
Sorting... Done
Full Text Search... Done
glx-alternative-nvidia/oldstable,now 1.2.1~deb11u1 amd64 [installed,automatic]
  allows the selection of NVIDIA as GLX provider

libegl-nvidia0/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary EGL library

libgl1-nvidia-glvnd-glx/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary OpenGL/GLX library (GLVND variant)

libgles-nvidia1/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary OpenGL|ES 1.x library

libgles-nvidia2/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary OpenGL|ES 2.x library

libglx-nvidia0/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary GLX library

nvidia-alternative/unknown,now 530.30.02-1 amd64 [installed,automatic]
  allows the selection of NVIDIA as GLX provider

nvidia-detect/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA GPU detection utility

nvidia-driver/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA metapackage

nvidia-driver-bin/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA driver support binaries

nvidia-driver-libs/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)

nvidia-driver-libs-i386/unknown 530.30.02-1 i386
  NVIDIA metapackage (OpenGL/GLX/EGL/GLES 32-bit libraries)

nvidia-kernel-dkms/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary kernel module DKMS source

nvidia-kernel-open/unknown 530.30.02-1 amd64
  NVIDIA binary kernel module source

nvidia-kernel-open-dkms/unknown 530.30.02-1 amd64
  NVIDIA binary kernel module DKMS source open flavor

nvidia-kernel-source/unknown 530.30.02-1 amd64
  NVIDIA binary kernel module source

xserver-xorg-video-nvidia/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary Xorg driver

michael@l4-2b:~$ apt search nvidia-driver
Sorting... Done
Full Text Search... Done
glx-alternative-nvidia/oldstable,now 1.2.1~deb11u1 amd64 [installed,automatic]
  allows the selection of NVIDIA as GLX provider

libegl-nvidia0/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary EGL library

libgl1-nvidia-glvnd-glx/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary OpenGL/GLX library (GLVND variant)

libgles-nvidia1/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary OpenGL|ES 1.x library

libgles-nvidia2/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary OpenGL|ES 2.x library

libglx-nvidia0/unknown,now 530.30.02-1 amd64 [installed,automatic]
  NVIDIA binary GLX library

nvidia-alternative/unknown,now 530.30.02-1 amd64 [installed,automa

check for both cards

michael@l4-2b:~$ lspci -nn | egrep -i "3d|display|vga"
00:03.0 3D controller [0302]: NVIDIA Corporation Device [10de:27b8] (rev a1)
00:04.0 3D controller [0302]: NVIDIA Corporation Device [10de:27b8] (rev a1)

michael@l4-2b:~$ nvidia-detect
Detected NVIDIA GPUs:
00:03.0 3D controller [0302]: NVIDIA Corporation Device [10de:27b8] (rev a1)
00:04.0 3D controller [0302]: NVIDIA Corporation Device [10de:27b8] (rev a1)

Checking card:  NVIDIA Corporation Device 27b8 (rev a1)
Uh oh. Your card is not supported by any driver version up to 530.30.02.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.

Checking card:  NVIDIA Corporation Device 27b8 (rev a1)
Uh oh. Your card is not supported by any driver version up to 530.30.02.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.

The driver from nvidia is too old The CUDA version from the google page is 12.1 - needs to be 12.3 https://cloud.google.com/compute/docs/gpus/install-drivers-gpu

bump https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local

to 12.3 https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local

switching to managed DL image
<img width="1082" alt="Screenshot 2023-11-30 at 14 23 50" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/39f5e0a9-3860-49bd-bd34-9053e7229c70">

wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-debian11-12-3-local_12.3.1-545.23.08-1_amd64.deb sudo dpkg -i cuda-repo-debian11-12-3-local_12.3.1-545.23.08-1_amd64.deb sudo cp /var/cuda-repo-debian11-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo add-apt-repository contrib sudo apt-get update sudo apt-get -y install cuda-toolkit-12-3

gcloud compute instances create l4-2e --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-2e,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310,mode=rw,size=50,type=projects/cuda-old/zones/us-east4-c/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Periodically in us-east4c there are issues with 2 L4s - spinning up 1 L4 ok - used the DeepLearning image
say yes to reinstalling
<img width="883" alt="Screenshot 2023-11-30 at 14 32 32" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/93a4894f-ccb2-476c-a4a5-98ebdd81b06e">

DRIVER_GCS_PATH=gs://nvidia-drivers-us-public/tesla/525.105.17 Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17 WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

All good

base) michael@l4-2:~$ nvidia-smi
Thu Nov 30 19:35:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA L4           Off  | 00000000:00:03.0 Off |                    0 |
| N/A   47C    P0    30W /  72W |      0MiB / 23034MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Dual L4 g2-standard-24 24/96G - running DL image

see https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/747
no need for manual driver install https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local - but follow
https://cloud.google.com/compute/docs/gpus/install-drivers-gpu

supplied drivers : | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |


michael@cloudshell:~ (clouddeploy-ol)$ gcloud config set project cuda-old
Updated property [core/project].
michael@cloudshell:~ (cuda-old)$ gcloud compute instances create l4-4-2 --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-4-2,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310,mode=rw,size=50,type=projects/cuda-old/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Created [https://www.googleapis.com/compute/v1/projects/cuda-old/zones/us-east4-c/instances/l4-4-2]. NAME: l4-4-2 ZONE: us-east4-c MACHINE_TYPE: g2-standard-24 PREEMPTIBLE: INTERNAL_IP: 10.150.0.10 EXTERNAL_IP: 34. STATUS: RUNNING

ssh

====================================== Welcome to the Google Deep Learning VM

Version: common-gpu.m113 Resources:

Google Deep Learning Platform StackOverflow: https://stackoverflow.com/questions/tagged/google-dl-platform
Google Cloud Documentation: https://cloud.google.com/deep-learning-vm
Google Group: https://groups.google.com/forum/#!forum/google-dl-platform

To reinstall Nvidia driver (if needed) run: sudo /opt/deeplearning/install-driver.sh Linux l4-4-2 5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64

The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.

This VM requires Nvidia drivers to function correctly. Installation takes ~1 minute. Would you like to install the Nvidia driver? [y/n]

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17...... WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

oik



<img width="894" alt="Screenshot 2023-11-30 at 15 00 28" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/dacbd70e-f94a-43f3-a00d-c35505636399">

TensorFlow / Keras ML training run

Run a standard concurrent saturation TensorFlow/Keras ML job from U of Toronto to check batch size optimums under 30 epochs to get close to 1.0 fitness - 25 avoids overfit

https://github.com/ObrienlabsDev/machine-learning

base) michael@l4-4-2:~$ git clone https://github.com/ObrienlabsDev/machine-learning.git
(base) michael@l4-4-2:~/machine-learning$ vi environments/windows/src/tflow.py 
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()

with strategy.scope():
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
# https://keras.io/api/models/model/
  parallel_model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)
  loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# https://keras.io/api/models/model_training_apis/
  parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
parallel_model.fit(x_train, y_train, epochs=30, batch_size=2048)#5120)#7168)#7168)

(base) michael@l4-4-2:~/machine-learning$ cat environments/windows/Dockerfile 
FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]

base) michael@l4-4-2:~/machine-learning$ ./build.sh 
Sending build context to Docker daemon  6.656kB
Step 1/4 : FROM tensorflow/tensorflow:latest-gpu
latest-gpu: Pulling from tensorflow/tensorflow

uccessfully tagged ml-tensorflow-win:latest
2023-11-30 20:29:26.443809: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-30 20:29:26.497571: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-30 20:29:26.497614: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-30 20:29:26.499104: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-30 20:29:26.506731: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-30 20:29:31.435829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20795 MB memory:  -> device: 0, name: NVIDIA L4, pci bus id: 0000:00:03.0, compute capability: 8.9
2023-11-30 20:29:31.437782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 20795 MB memory:  -> device: 1, name: NVIDIA L4, pci bus id: 0000:00:04.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 3s 0us/step
Epoch 1/30

023-11-30 20:30:19.985861: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:20.001134: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:29.957119: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f9c6bf3a4f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-30 20:30:29.957184: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.957192: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.965061: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701376230.063893      80 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

25/25 [==============================] - 71s 317ms/step - loss: 4.9465 - accuracy: 0.0418
Epoch 2/30
25/25 [==============================] - 4s 142ms/step - loss: 3.8430 - accuracy: 0.1214
Epoch 3/30
25/25 [==============================] - 4s 142ms/step - loss: 3.3694 - accuracy: 0.1967
Epoch 4/30
25/25 [==============================] - 4s 143ms/step - loss: 3.0832 - accuracy: 0.2544
Epoch 5/30
25/25 [==============================] - 4s 143ms/step - loss: 2.7049 - accuracy: 0.3326
Epoch 6/30
25/25 [==============================] - 4s 143ms/step - loss: 2.3329 - accuracy: 0.4119
Epoch 7/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9781 - accuracy: 0.4824
Epoch 8/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9177 - accuracy: 0.4948
Epoch 9/30
25/25 [==============================] - 4s 142ms/step - loss: 1.4980 - accuracy: 0.5937
Epoch 10/30
25/25 [==============================] - 4s 144ms/step - loss: 1.3247 - accuracy: 0.6322
Epoch 11/30
25/25 [==============================] - 4s 142ms/step - loss: 1.0408 - accuracy: 0.7063
Epoch 12/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9150 - accuracy: 0.7439
Epoch 13/30
25/25 [==============================] - 4s 143ms/step - loss: 0.8210 - accuracy: 0.7648
Epoch 14/30
25/25 [==============================] - 4s 142ms/step - loss: 0.5581 - accuracy: 0.8424
Epoch 15/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4635 - accuracy: 0.8709
Epoch 16/30
25/25 [==============================] - 4s 142ms/step - loss: 0.4771 - accuracy: 0.8610
Epoch 17/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9404 - accuracy: 0.7228
Epoch 18/30
25/25 [==============================] - 4s 143ms/step - loss: 0.5478 - accuracy: 0.8385
Epoch 19/30
25/25 [==============================] - 4s 143ms/step - loss: 0.4107 - accuracy: 0.8867
Epoch 20/30
25/25 [==============================] - 4s 143ms/step - loss: 0.2424 - accuracy: 0.9345
Epoch 21/30
25/25 [==============================] - 4s 146ms/step - loss: 0.1677 - accuracy: 0.9587
Epoch 22/30
25/25 [==============================] - 4s 142ms/step - loss: 0.1419 - accuracy: 0.9659
Epoch 23/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1861 - accuracy: 0.9510
Epoch 24/30
25/25 [==============================] - 4s 141ms/step - loss: 0.2771 - accuracy: 0.9264
Epoch 25/30
25/25 [==============================] - 4s 142ms/step - loss: 0.2663 - accuracy: 0.9326
Epoch 26/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1710 - accuracy: 0.9600
Epoch 27/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4977 - accuracy: 0.8626
Epoch 28/30
25/25 [==============================] - 4s 141ms/step - loss: 0.6559 - accuracy: 0.8100
Epoch 29/30
25/25 [==============================] - 4s 143ms/step - loss: 0.3074 - accuracy: 0.9105
Epoch 30/30
25/25 [==============================] - 4s 143ms/step - loss: 0.1834 - accuracy: 0.9515
(base) michael@l4-4-2:~/machine-learning$

Batch = 2048, epochs = 25
Epoch 24/25
25/25 [==============================] - 4s 144ms/step - loss: 0.2537 - accuracy: 0.9221
Epoch 25/25
25/25 [==============================] - 4s 145ms/step - loss: 0.2258 - accuracy: 0.9300

VM image with tensorflow ML repo

gcloud beta compute machine-images create l4-2-us-east-1c-w-ml-repo --project=cuda-old --description=l4-2-us-east-1c-w-ml-repo-20231130 --source-instance=l4-4-2 --source-instance-zone=us-east4-c --storage-location=us

Local LLM

Local LLM hosting up to 49G VRAM on 64G or 101G VRAM on 128G Apple Silicon in prep of CSP hosted deployment of inference model https://medium.com/@obrienlabs/running-the-70b-llama-2-llm-locally-on-metal-via-llama-cpp-on-mac-studio-m2-ultra-32b3179e9cbe

GCP Hosted LLM IaaS - minimum 8 x H100 80G - $65/hr - Nvidia GRID n/a

gcloud compute instances create a100a --project=cuda-old --zone=us-central1-a --machine-type=a3-highgpu-8g --network-interface=network-tier=PREMIUM,nic-type=GVNIC,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=count=8,type=nvidia-h100-80gb --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=a100a,image=projects/debian-cloud/global/images/debian-12-bookworm-v20240110,mode=rw,size=40,type=projects/cuda-old/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Unfortunately, we are unable to grant you additional quota at this time. If this is a new project please wait 48h until you resubmit the request or until your Billing account has additional history

The GPUS-ALL-REGIONS-per-project quota maximum has been exceeded. Current limit: 4.0. Metric: compute.googleapis.com/gpus_all_regions. More dimension(s): global=global

Try older A100-40G There is a new inline quota request button on the console - that leverages the automated 1-3 min bot

good new A100 quota is opening up

Your quota request for cuda-old has been approved and your project quota has been adjusted according to the following requested limits:

+------------------+-----------------+----------+-----------------+----------------+
| NAME             | DIMENSIONS      | REGION   | REQUESTED LIMIT | APPROVED LIMIT |
+------------------+-----------------+----------+-----------------+----------------+
| A2_CPUS          | region=us-east1 | us-east1 | 12              | 12             |
|                  |                 |          |                 |                |
| CPUS_ALL_REGIONS |                 | GLOBAL   | 12              | 12             |
|                  |                 |          |                 |                |
| NVIDIA_A100_GPUS | region=us-east1 | us-east1 | 1               | 1              |

gcloud compute instances create a100c --project=cuda-old --zone=us-east1-b --machine-type=a2-highgpu-1g --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=count=1,type=nvidia-tesla-a100 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=a100c,image=projects/debian-cloud/global/images/debian-12-bookworm-v20240110,mode=rw,size=200,type=projects/cuda-old/zones/us-east1-b/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Quota is open but capacity is still pending

A a2-highgpu-1g VM instance with 1 nvidia-tesla-a100 accelerator(s) is currently unavailable in the us-east1-b zone. Alternatively, you can try your request again with a different VM hardware configuration or at a later time

GoogleCloudPlatform / pubsec-declarative-toolkit

Prep for H100 access on new G3 instance with IPU networking for both CUDA and TensorFlow images via quad L4 - start only with IaaS docker pass through before GKE deployment #747