Open fmichaelobrien opened 7 months ago
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-debian11-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
#sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda
michael@l4-2a:~$ sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
nvidia-alternative : Depends: glx-alternative-nvidia (>= 1.0) but it is not installable
E: Unable to correct problems, you have held broken packages.
rerun
michael@l4-2a:~$ sudo add-apt-repository contrib
sudo: add-apt-repository: command not found
michael@l4-2a:~$ sudo apt install software-properties-common
https://linuxhint.com/fix-sudo-add-apt-repository-command-not-found-error-linux-ubuntu/
hanging - but will take 5 min to complete
Processing triggers for dbus (1.12.28-0+deb11u1) ...
fixed
michael@l4-2a:~$ sudo add-apt-repository contrib
'contrib' distribution component enabled for all sources.
sudo apt-get update
sudo apt-get -y install cuda
get a larger SSD
cannot copy extracted data for './usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so' to '/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so.dpkg-new': failed to write (No space left on device)
Selecting previously unselected package cuda-cudart-dev-12-1.
Preparing to unpack .../196-cuda-cudart-dev-12-1_12.1.55-1_amd64.deb ...
dpkg: unrecoverable fatal error, aborting:
unable to flush /var/lib/dpkg/updates/tmp.i after padding: No space left on device
E: Sub-process /usr/bin/dpkg returned an error code (2)
michael@l4-2a:~$ df
Filesystem 1K-blocks Used Available Use% Mounted on
udev 49433116 0 49433116 0% /dev
tmpfs 9888784 476 9888308 1% /run
/dev/nvme0n1p1 10089736 10036060 0 100% /
tmpfs 49443920 0 49443920 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
/dev/nvme0n1p15 126678 10900 115778 9% /boot/efi
tmpfs 9888784 0 9888784 0% /run/user/1000
remove the image
rw-r--r-- 1 michael michael 3247734826 Feb 23 2023 cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
/dev/nvme0n1p1 10089736 6864436 2691184 72% /
sudo dpkg --configure -a
update-initramfs: Generating /boot/initrd.img-5.10.0-26-cloud-amd64
Errors were encountered while processing:
cuda-libraries-12-1
cuda-runtime-12-1
cuda-drivers-530
cuda-drivers
michael@l4-2a:~$ sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
cuda : Depends: cuda-12-1 (>= 12.1.0) but it is not going to be installed
cuda-cudart-dev-12-1 : Depends: cuda-cccl-12-1 but it is not going to be installed
sudo apt --fix-broken install
Setting up libnvidia-opticalflow1:amd64 (530.30.02-1) ...
Setting up libnvidia-encode1:amd64 (530.30.02-1) ...
Setting up cuda-drivers-530 (530.30.02-1) ...
Setting up cuda-drivers (530.30.02-1) ...
Setting up cuda-runtime-12-1 (12.1.0-1) ...
Processing triggers for glx-alternative-nvidia (1.2.1~deb11u1) ...
Processing triggers for glx-alternative-mesa (1.2.1~deb11u1) ...
Processing triggers for libc-bin (2.31-13+deb11u7) ...
Processing triggers for update-glx (1.2.1~deb11u1) ...
Processing triggers for glx-alternative-nvidia (1.2.1~deb11u1) ...
Processing triggers for libc-bin (2.31-13+deb11u7) ...
Processing triggers for initramfs-tools (0.140) ...
update-initramfs: Generating /boot/initrd.img-5.10.0-26-cloud-amd64
check nvidia-smi
rerun install after install fix
sudo apt-get -y install cuda
dpkg: unrecoverable fatal error, aborting:
unable to flush /var/lib/dpkg/updates/tmp.i after padding: No space left on device
E: Sub-process /usr/bin/dpkg returned an error code (2)
10G full again - recreating the VM with an 80G drive
/dev/nvme0n1p1 10089736 10036372 0 100% /
dual 24 cpus / 96g ram
4 cpus / 16g ram
https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#console
gcloud compute instances create l4-2b --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-2b,image=projects/debian-cloud/global/images/debian-11-bullseye-v20231115,mode=rw,size=80,type=projects/cuda-old/zones/us-east4-c/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
Install NVidia CUDA drivers https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local
with additional
sudo apt install software-properties-common
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-debian11-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-debian11-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt install software-properties-common
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda
- note gui language install step
michael@l4-2b:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
before cuda
michael@l4-2b:~$ df
/dev/nvme0n1p1 82317152 8491284 70355856 11% /
after
/dev/nvme0n1p1 82317152 17862648 60984492 23% /
https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#script-limitations
sudo reboot now
Check NVidia driver
ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Nov 30 18:28:13 2023 from 35.235.241.34
michael@l4-2b:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
michael@l4-2b:~$ nvidia-settings -q NvidiaDriverVersion
Unable to init server: Could not connect: Connection refused
ERROR: The control display is undefined; please run `nvidia-settings --help` for usage information.
michael@l4-2b:~$ nvidia-settings
Unable to init server: Could not connect: Connection refused
ERROR: The control display is undefined; please run `nvidia-settings --help` for usage information.
michael@l4-2b:~$ cat /proc/driver/nvidia/version
cat: /proc/driver/nvidia/version: No such file or directory
michael@l4-2b:~$ ubuntu-drivers devices
-bash: ubuntu-drivers: command not found
michael@l4-2b:~$ apt search nvidia-driver
Sorting... Done
Full Text Search... Done
glx-alternative-nvidia/oldstable,now 1.2.1~deb11u1 amd64 [installed,automatic]
allows the selection of NVIDIA as GLX provider
libegl-nvidia0/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary EGL library
libgl1-nvidia-glvnd-glx/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary OpenGL/GLX library (GLVND variant)
libgles-nvidia1/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary OpenGL|ES 1.x library
libgles-nvidia2/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary OpenGL|ES 2.x library
libglx-nvidia0/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary GLX library
nvidia-alternative/unknown,now 530.30.02-1 amd64 [installed,automatic]
allows the selection of NVIDIA as GLX provider
nvidia-detect/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA GPU detection utility
nvidia-driver/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA metapackage
nvidia-driver-bin/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA driver support binaries
nvidia-driver-libs/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
nvidia-driver-libs-i386/unknown 530.30.02-1 i386
NVIDIA metapackage (OpenGL/GLX/EGL/GLES 32-bit libraries)
nvidia-kernel-dkms/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary kernel module DKMS source
nvidia-kernel-open/unknown 530.30.02-1 amd64
NVIDIA binary kernel module source
nvidia-kernel-open-dkms/unknown 530.30.02-1 amd64
NVIDIA binary kernel module DKMS source open flavor
nvidia-kernel-source/unknown 530.30.02-1 amd64
NVIDIA binary kernel module source
xserver-xorg-video-nvidia/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary Xorg driver
michael@l4-2b:~$ apt search nvidia-driver
Sorting... Done
Full Text Search... Done
glx-alternative-nvidia/oldstable,now 1.2.1~deb11u1 amd64 [installed,automatic]
allows the selection of NVIDIA as GLX provider
libegl-nvidia0/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary EGL library
libgl1-nvidia-glvnd-glx/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary OpenGL/GLX library (GLVND variant)
libgles-nvidia1/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary OpenGL|ES 1.x library
libgles-nvidia2/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary OpenGL|ES 2.x library
libglx-nvidia0/unknown,now 530.30.02-1 amd64 [installed,automatic]
NVIDIA binary GLX library
nvidia-alternative/unknown,now 530.30.02-1 amd64 [installed,automa
check for both cards
michael@l4-2b:~$ lspci -nn | egrep -i "3d|display|vga"
00:03.0 3D controller [0302]: NVIDIA Corporation Device [10de:27b8] (rev a1)
00:04.0 3D controller [0302]: NVIDIA Corporation Device [10de:27b8] (rev a1)
michael@l4-2b:~$ nvidia-detect
Detected NVIDIA GPUs:
00:03.0 3D controller [0302]: NVIDIA Corporation Device [10de:27b8] (rev a1)
00:04.0 3D controller [0302]: NVIDIA Corporation Device [10de:27b8] (rev a1)
Checking card: NVIDIA Corporation Device 27b8 (rev a1)
Uh oh. Your card is not supported by any driver version up to 530.30.02.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
Checking card: NVIDIA Corporation Device 27b8 (rev a1)
Uh oh. Your card is not supported by any driver version up to 530.30.02.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
The driver from nvidia is too old The CUDA version from the google page is 12.1 - needs to be 12.3 https://cloud.google.com/compute/docs/gpus/install-drivers-gpu
switching to managed DL image
<img width="1082" alt="Screenshot 2023-11-30 at 14 23 50" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/39f5e0a9-3860-49bd-bd34-9053e7229c70">
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-debian11-12-3-local_12.3.1-545.23.08-1_amd64.deb sudo dpkg -i cuda-repo-debian11-12-3-local_12.3.1-545.23.08-1_amd64.deb sudo cp /var/cuda-repo-debian11-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo add-apt-repository contrib sudo apt-get update sudo apt-get -y install cuda-toolkit-12-3
gcloud compute instances create l4-2e --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-2e,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310,mode=rw,size=50,type=projects/cuda-old/zones/us-east4-c/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
Periodically in us-east4c there are issues with 2 L4s - spinning up 1 L4 ok - used the DeepLearning image
say yes to reinstalling
<img width="883" alt="Screenshot 2023-11-30 at 14 32 32" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/93a4894f-ccb2-476c-a4a5-98ebdd81b06e">
base) michael@l4-2:~$ nvidia-smi
Thu Nov 30 19:35:58 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 47C P0 30W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
michael@cloudshell:~ (clouddeploy-ol)$ gcloud config set project cuda-old
Updated property [core/project].
michael@cloudshell:~ (cuda-old)$ gcloud compute instances create l4-4-2 --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-4-2,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310,mode=rw,size=50,type=projects/cuda-old/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
Created [https://www.googleapis.com/compute/v1/projects/cuda-old/zones/us-east4-c/instances/l4-4-2]. NAME: l4-4-2 ZONE: us-east4-c MACHINE_TYPE: g2-standard-24 PREEMPTIBLE: INTERNAL_IP: 10.150.0.10 EXTERNAL_IP: 34. STATUS: RUNNING
ssh
Version: common-gpu.m113 Resources:
To reinstall Nvidia driver (if needed) run: sudo /opt/deeplearning/install-driver.sh Linux l4-4-2 5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64
The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.
This VM requires Nvidia drivers to function correctly. Installation takes ~1 minute. Would you like to install the Nvidia driver? [y/n]
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17...... WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
oik
running a python ve
(base) michael@l4-4-2:~$ nvidia-smi
Thu Nov 30 19:51:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 60C P0 32W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 Off | 00000000:00:04.0 Off | 0 |
| N/A 57C P0 31W / 72W | 0MiB / 23034MiB | 7% Default |
| | | N/A |
<img width="894" alt="Screenshot 2023-11-30 at 15 00 28" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/dacbd70e-f94a-43f3-a00d-c35505636399">
Run a standard concurrent saturation TensorFlow/Keras ML job from U of Toronto to check batch size optimums under 30 epochs to get close to 1.0 fitness - 25 avoids overfit
https://github.com/ObrienlabsDev/machine-learning
base) michael@l4-4-2:~$ git clone https://github.com/ObrienlabsDev/machine-learning.git
(base) michael@l4-4-2:~/machine-learning$ vi environments/windows/src/tflow.py
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
with strategy.scope():
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
# https://keras.io/api/models/model/
parallel_model = tf.keras.applications.ResNet50(
include_top=True,
weights=None,
input_shape=(32, 32, 3),
classes=100,)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# https://keras.io/api/models/model_training_apis/
parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
parallel_model.fit(x_train, y_train, epochs=30, batch_size=2048)#5120)#7168)#7168)
(base) michael@l4-4-2:~/machine-learning$ cat environments/windows/Dockerfile
FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]
base) michael@l4-4-2:~/machine-learning$ ./build.sh
Sending build context to Docker daemon 6.656kB
Step 1/4 : FROM tensorflow/tensorflow:latest-gpu
latest-gpu: Pulling from tensorflow/tensorflow
uccessfully tagged ml-tensorflow-win:latest
2023-11-30 20:29:26.443809: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-30 20:29:26.497571: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-30 20:29:26.497614: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-30 20:29:26.499104: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-30 20:29:26.506731: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-30 20:29:31.435829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20795 MB memory: -> device: 0, name: NVIDIA L4, pci bus id: 0000:00:03.0, compute capability: 8.9
2023-11-30 20:29:31.437782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 20795 MB memory: -> device: 1, name: NVIDIA L4, pci bus id: 0000:00:04.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 3s 0us/step
Epoch 1/30
023-11-30 20:30:19.985861: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:20.001134: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:29.957119: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f9c6bf3a4f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-30 20:30:29.957184: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.957192: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (1): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.965061: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701376230.063893 80 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
25/25 [==============================] - 71s 317ms/step - loss: 4.9465 - accuracy: 0.0418
Epoch 2/30
25/25 [==============================] - 4s 142ms/step - loss: 3.8430 - accuracy: 0.1214
Epoch 3/30
25/25 [==============================] - 4s 142ms/step - loss: 3.3694 - accuracy: 0.1967
Epoch 4/30
25/25 [==============================] - 4s 143ms/step - loss: 3.0832 - accuracy: 0.2544
Epoch 5/30
25/25 [==============================] - 4s 143ms/step - loss: 2.7049 - accuracy: 0.3326
Epoch 6/30
25/25 [==============================] - 4s 143ms/step - loss: 2.3329 - accuracy: 0.4119
Epoch 7/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9781 - accuracy: 0.4824
Epoch 8/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9177 - accuracy: 0.4948
Epoch 9/30
25/25 [==============================] - 4s 142ms/step - loss: 1.4980 - accuracy: 0.5937
Epoch 10/30
25/25 [==============================] - 4s 144ms/step - loss: 1.3247 - accuracy: 0.6322
Epoch 11/30
25/25 [==============================] - 4s 142ms/step - loss: 1.0408 - accuracy: 0.7063
Epoch 12/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9150 - accuracy: 0.7439
Epoch 13/30
25/25 [==============================] - 4s 143ms/step - loss: 0.8210 - accuracy: 0.7648
Epoch 14/30
25/25 [==============================] - 4s 142ms/step - loss: 0.5581 - accuracy: 0.8424
Epoch 15/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4635 - accuracy: 0.8709
Epoch 16/30
25/25 [==============================] - 4s 142ms/step - loss: 0.4771 - accuracy: 0.8610
Epoch 17/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9404 - accuracy: 0.7228
Epoch 18/30
25/25 [==============================] - 4s 143ms/step - loss: 0.5478 - accuracy: 0.8385
Epoch 19/30
25/25 [==============================] - 4s 143ms/step - loss: 0.4107 - accuracy: 0.8867
Epoch 20/30
25/25 [==============================] - 4s 143ms/step - loss: 0.2424 - accuracy: 0.9345
Epoch 21/30
25/25 [==============================] - 4s 146ms/step - loss: 0.1677 - accuracy: 0.9587
Epoch 22/30
25/25 [==============================] - 4s 142ms/step - loss: 0.1419 - accuracy: 0.9659
Epoch 23/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1861 - accuracy: 0.9510
Epoch 24/30
25/25 [==============================] - 4s 141ms/step - loss: 0.2771 - accuracy: 0.9264
Epoch 25/30
25/25 [==============================] - 4s 142ms/step - loss: 0.2663 - accuracy: 0.9326
Epoch 26/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1710 - accuracy: 0.9600
Epoch 27/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4977 - accuracy: 0.8626
Epoch 28/30
25/25 [==============================] - 4s 141ms/step - loss: 0.6559 - accuracy: 0.8100
Epoch 29/30
25/25 [==============================] - 4s 143ms/step - loss: 0.3074 - accuracy: 0.9105
Epoch 30/30
25/25 [==============================] - 4s 143ms/step - loss: 0.1834 - accuracy: 0.9515
(base) michael@l4-4-2:~/machine-learning$
Batch = 2048, epochs = 25
Epoch 24/25
25/25 [==============================] - 4s 144ms/step - loss: 0.2537 - accuracy: 0.9221
Epoch 25/25
25/25 [==============================] - 4s 145ms/step - loss: 0.2258 - accuracy: 0.9300
VM image with tensorflow ML repo
gcloud beta compute machine-images create l4-2-us-east-1c-w-ml-repo --project=cuda-old --description=l4-2-us-east-1c-w-ml-repo-20231130 --source-instance=l4-4-2 --source-instance-zone=us-east4-c --storage-location=us
Local LLM hosting up to 49G VRAM on 64G or 101G VRAM on 128G Apple Silicon in prep of CSP hosted deployment of inference model https://medium.com/@obrienlabs/running-the-70b-llama-2-llm-locally-on-metal-via-llama-cpp-on-mac-studio-m2-ultra-32b3179e9cbe
gcloud compute instances create a100a --project=cuda-old --zone=us-central1-a --machine-type=a3-highgpu-8g --network-interface=network-tier=PREMIUM,nic-type=GVNIC,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=count=8,type=nvidia-h100-80gb --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=a100a,image=projects/debian-cloud/global/images/debian-12-bookworm-v20240110,mode=rw,size=40,type=projects/cuda-old/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
Unfortunately, we are unable to grant you additional quota at this time. If this is a new project please wait 48h until you resubmit the request or until your Billing account has additional history
The GPUS-ALL-REGIONS-per-project quota maximum has been exceeded. Current limit: 4.0. Metric: compute.googleapis.com/gpus_all_regions. More dimension(s): global=global
Try older A100-40G There is a new inline quota request button on the console - that leverages the automated 1-3 min bot
good new A100 quota is opening up
Your quota request for cuda-old has been approved and your project quota has been adjusted according to the following requested limits:
+------------------+-----------------+----------+-----------------+----------------+
| NAME | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+------------------+-----------------+----------+-----------------+----------------+
| A2_CPUS | region=us-east1 | us-east1 | 12 | 12 |
| | | | | |
| CPUS_ALL_REGIONS | | GLOBAL | 12 | 12 |
| | | | | |
| NVIDIA_A100_GPUS | region=us-east1 | us-east1 | 1 | 1 |
gcloud compute instances create a100c --project=cuda-old --zone=us-east1-b --machine-type=a2-highgpu-1g --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=count=1,type=nvidia-tesla-a100 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=a100c,image=projects/debian-cloud/global/images/debian-12-bookworm-v20240110,mode=rw,size=200,type=projects/cuda-old/zones/us-east1-b/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
Quota is open but capacity is still pending
A a2-highgpu-1g VM instance with 1 nvidia-tesla-a100 accelerator(s) is currently unavailable in the us-east1-b zone. Alternatively, you can try your request again with a different VM hardware configuration or at a later time
Blog: https://github.com/ObrienlabsDev/machine-learning/issues/13 Branch: https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/tree/747-gpu
Configurations
H100
Amsterdam![image](https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/3064b367-9593-4a2c-b526-56921818952f)
-- | -- 208 vCPU + 1872 GB memory | $7,251.14 8 NVIDIA H100 80GB | $74,375.41 6000 GiB Local SSD disks | $528.00 10 GB balanced persistent disk | $1.10 Use discount | -$22,312.62 Total | $59,843.03
L4
VM
GPU
1) The A3 VM is the first VM with the Intel 200mbps IPU (not using NVidia's Infiniband)
https://www.intel.com/content/www/us/en/products/details/network-io/ipu.html from https://cloud.google.com/blog/products/compute/introducing-a3-supercomputers-with-nvidia-h100-gpus
Note: if the google supplied deep learning images based off Debian are not used - check your MTU settings - thanks Henry https://www.civo.com/learn/fixing-networking-for-docker
2) Start with prebuilt linux images from NVidia by CUDA version
or go directly to a TensorFlow capable image
Start with L4 (more available) via g2-standard-8 and move up to H100
Spin up a single and multi GPU L4 instance to test the tensorflow DockerFile with single and parallel strategy objects
Single
Multi
Quota
quota is a part of a 2 step process - after your quota is approved, the zone may not have resources to service your request - retry or move zones
keep alternate zones around - the best strategy is to keep your VM/GPU up
quota exercise for V100/A100/H100 will involve your sales rep - try the following form to ask for H100 in parallel
https://docs.google.com/forms/d/e/1FAIpQLSfWP2weHCBj9AliES43_TA0LO4oOaP5sbGDWWPSbe-NaBuxJA/viewform
for example
L4 prototype for H100 prep - on us-east4-c using DL image
TensforFlow / Keras test ML training run
Run a standard concurrent saturation TensorFlow/Keras ML job from U of Toronto to check batch size optimums under 30 epochs to get close to 1.0 fitness - 25 avoids overfit
https://github.com/ObrienlabsDev/machine-learning