Open obriensystems opened 1 year ago
michael@cloudshell:~ (clouddeploy-ol)$ gcloud config set project cuda-old
Updated property [core/project].
michael@cloudshell:~ (cuda-old)$ gcloud compute instances create l4-4-2 --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-4-2,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310,mode=rw,size=50,type=projects/cuda-old/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
Created [https://www.googleapis.com/compute/v1/projects/cuda-old/zones/us-east4-c/instances/l4-4-2]. NAME: l4-4-2 ZONE: us-east4-c MACHINE_TYPE: g2-standard-24 PREEMPTIBLE: INTERNAL_IP: 10.150.0.10 EXTERNAL_IP: 34. STATUS: RUNNING
ssh
Version: common-gpu.m113 Resources:
To reinstall Nvidia driver (if needed) run: sudo /opt/deeplearning/install-driver.sh Linux l4-4-2 5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64
The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.
This VM requires Nvidia drivers to function correctly. Installation takes ~1 minute. Would you like to install the Nvidia driver? [y/n]
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17...... WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
oik
running a python ve
(base) michael@l4-4-2:~$ nvidia-smi
Thu Nov 30 19:51:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 60C P0 32W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 Off | 00000000:00:04.0 Off | 0 |
| N/A 57C P0 31W / 72W | 0MiB / 23034MiB | 7% Default |
| | | N/A |
<img width="894" alt="Screenshot 2023-11-30 at 15 00 28" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/dacbd70e-f94a-43f3-a00d-c35505636399">
Run a standard concurrent saturation TensorFlow/Keras ML job from U of Toronto to check batch size optimums under 30 epochs to get close to 1.0 fitness - 25 avoids overfit
https://github.com/ObrienlabsDev/machine-learning
base) michael@l4-4-2:~$ git clone https://github.com/ObrienlabsDev/machine-learning.git
(base) michael@l4-4-2:~/machine-learning$ vi environments/windows/src/tflow.py
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
with strategy.scope():
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
# https://keras.io/api/models/model/
parallel_model = tf.keras.applications.ResNet50(
include_top=True,
weights=None,
input_shape=(32, 32, 3),
classes=100,)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# https://keras.io/api/models/model_training_apis/
parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
parallel_model.fit(x_train, y_train, epochs=30, batch_size=2048)#5120)#7168)#7168)
(base) michael@l4-4-2:~/machine-learning$ cat environments/windows/Dockerfile
FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]
base) michael@l4-4-2:~/machine-learning$ ./build.sh
Sending build context to Docker daemon 6.656kB
Step 1/4 : FROM tensorflow/tensorflow:latest-gpu
latest-gpu: Pulling from tensorflow/tensorflow
uccessfully tagged ml-tensorflow-win:latest
2023-11-30 20:29:26.443809: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-30 20:29:26.497571: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-30 20:29:26.497614: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-30 20:29:26.499104: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-30 20:29:26.506731: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-30 20:29:31.435829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20795 MB memory: -> device: 0, name: NVIDIA L4, pci bus id: 0000:00:03.0, compute capability: 8.9
2023-11-30 20:29:31.437782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 20795 MB memory: -> device: 1, name: NVIDIA L4, pci bus id: 0000:00:04.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 3s 0us/step
Epoch 1/30
023-11-30 20:30:19.985861: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:20.001134: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:29.957119: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f9c6bf3a4f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-30 20:30:29.957184: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.957192: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (1): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.965061: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701376230.063893 80 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
25/25 [==============================] - 71s 317ms/step - loss: 4.9465 - accuracy: 0.0418
Epoch 2/30
25/25 [==============================] - 4s 142ms/step - loss: 3.8430 - accuracy: 0.1214
Epoch 3/30
25/25 [==============================] - 4s 142ms/step - loss: 3.3694 - accuracy: 0.1967
Epoch 4/30
25/25 [==============================] - 4s 143ms/step - loss: 3.0832 - accuracy: 0.2544
Epoch 5/30
25/25 [==============================] - 4s 143ms/step - loss: 2.7049 - accuracy: 0.3326
Epoch 6/30
25/25 [==============================] - 4s 143ms/step - loss: 2.3329 - accuracy: 0.4119
Epoch 7/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9781 - accuracy: 0.4824
Epoch 8/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9177 - accuracy: 0.4948
Epoch 9/30
25/25 [==============================] - 4s 142ms/step - loss: 1.4980 - accuracy: 0.5937
Epoch 10/30
25/25 [==============================] - 4s 144ms/step - loss: 1.3247 - accuracy: 0.6322
Epoch 11/30
25/25 [==============================] - 4s 142ms/step - loss: 1.0408 - accuracy: 0.7063
Epoch 12/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9150 - accuracy: 0.7439
Epoch 13/30
25/25 [==============================] - 4s 143ms/step - loss: 0.8210 - accuracy: 0.7648
Epoch 14/30
25/25 [==============================] - 4s 142ms/step - loss: 0.5581 - accuracy: 0.8424
Epoch 15/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4635 - accuracy: 0.8709
Epoch 16/30
25/25 [==============================] - 4s 142ms/step - loss: 0.4771 - accuracy: 0.8610
Epoch 17/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9404 - accuracy: 0.7228
Epoch 18/30
25/25 [==============================] - 4s 143ms/step - loss: 0.5478 - accuracy: 0.8385
Epoch 19/30
25/25 [==============================] - 4s 143ms/step - loss: 0.4107 - accuracy: 0.8867
Epoch 20/30
25/25 [==============================] - 4s 143ms/step - loss: 0.2424 - accuracy: 0.9345
Epoch 21/30
25/25 [==============================] - 4s 146ms/step - loss: 0.1677 - accuracy: 0.9587
Epoch 22/30
25/25 [==============================] - 4s 142ms/step - loss: 0.1419 - accuracy: 0.9659
Epoch 23/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1861 - accuracy: 0.9510
Epoch 24/30
25/25 [==============================] - 4s 141ms/step - loss: 0.2771 - accuracy: 0.9264
Epoch 25/30
25/25 [==============================] - 4s 142ms/step - loss: 0.2663 - accuracy: 0.9326
Epoch 26/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1710 - accuracy: 0.9600
Epoch 27/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4977 - accuracy: 0.8626
Epoch 28/30
25/25 [==============================] - 4s 141ms/step - loss: 0.6559 - accuracy: 0.8100
Epoch 29/30
25/25 [==============================] - 4s 143ms/step - loss: 0.3074 - accuracy: 0.9105
Epoch 30/30
25/25 [==============================] - 4s 143ms/step - loss: 0.1834 - accuracy: 0.9515
(base) michael@l4-4-2:~/machine-learning$
Batch = 2048, epochs = 25
Epoch 24/25
25/25 [==============================] - 4s 144ms/step - loss: 0.2537 - accuracy: 0.9221
Epoch 25/25
25/25 [==============================] - 4s 145ms/step - loss: 0.2258 - accuracy: 0.9300
2 of 4
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])#, "/gpu:2", "/gpu:3"])
parallel_model.fit(x_train, y_train, epochs=25, batch_size=2048)
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 78C P0 66W / 72W | 21070MiB / 23034MiB | 82% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 Off | 00000000:00:04.0 Off | 0 |
| N/A 77C P0 69W / 72W | 21070MiB / 23034MiB | 78% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA L4 Off | 00000000:00:05.0 Off | 0 |
| N/A 64C P0 33W / 72W | 196MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA L4 Off | 00000000:00:06.0 Off | 0 |
| N/A 64C P0 31W / 72W | 196MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 15778 C python 21058MiB |
| 1 N/A N/A 15778 C python 21058MiB |
| 2 N/A N/A 15778 C python 184MiB |
| 3 N/A N/A 15778 C python 184MiB |
+---------------------------------------------------------------------------------------+
4 L4s on a2-standard-48 aggregated 80G (same as V100, A100, H100 - but a lot lower bus width
More than 2 GPU's - same issue as https://github.com/tensorflow/tensorflow/issues/41724
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 71C P0 32W / 72W | 20958MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 Off | 00000000:00:04.0 Off | 0 |
| N/A 71C P0 35W / 72W | 20956MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA L4 Off | 00000000:00:05.0 Off | 0 |
| N/A 66C P0 34W / 72W | 20956MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA L4 Off | 00000000:00:06.0 Off | 0 |
| N/A 65C P0 31W / 72W | 20956MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 37338 C python 20946MiB |
| 1 N/A N/A 37338 C python 20944MiB |
| 2 N/A N/A 37338 C python 20944MiB |
| 3 N/A N/A 37338 C python 20944MiB |
+---------------------------------------------------------------------------------------+
Epoch 1/25
2023-12-01 01:56:26.358086: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-12-01 01:56:26.370835: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-12-01 01:56:26.389974: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-12-01 01:56:26.407626: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"])
or
strategy = tf.distribute.MirroredStrategy()#devices=["/gpu:0", "/gpu:1"])#, "/gpu:2", "/gpu:3"])
Issues with more than 2 GPUs both on GCP and using an on prem 3 GPU setup - two RTX-4500s and one RTX-4000
Working fine with 2 GPUs
On 4 L4s or 3 RTX-4500/4500/4000
https://github.com/tensorflow/tensorflow/issues/41724#issuecomment-665996179
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice())
parallel_model.fit(x_train, y_train, epochs=25, batch_size=2048)
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 80C P0 62W / 72W | 21002MiB / 23034MiB | 58% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 Off | 00000000:00:04.0 Off | 0 |
| N/A 78C P0 67W / 72W | 20994MiB / 23034MiB | 46% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA L4 Off | 00000000:00:05.0 Off | 0 |
| N/A 76C P0 67W / 72W | 20998MiB / 23034MiB | 55% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA L4 Off | 00000000:00:06.0 Off | 0 |
| N/A 75C P0 51W / 72W | 21002MiB / 23034MiB | 55% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 40306 C python 20990MiB |
| 1 N/A N/A 40306 C python 20982MiB |
| 2 N/A N/A 40306 C python 20986MiB |
| 3 N/A N/A 40306 C python 20990MiB |
+---------------------------------------------------------------------------------------+
Epoch 24/25
25/25 [==============================] - 3s 105ms/step - loss: 0.2089 - accuracy: 0.9445
Epoch 25/25
25/25 [==============================] - 3s 105ms/step - loss: 0.1559 - accuracy: 0.9592
gcloud compute instances create l4-8c --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-96 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=8,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-8c,image=projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231105-debian-11,mode=rw,size=50,type=projects/cuda-old/zones/us-east4-c/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
- Quota 'GPUS_ALL_REGIONS' exceeded. Limit: 4.0 globally.
metric name = compute.googleapis.com/gpus_all_regions
limit name = GPUS-ALL-REGIONS-per-project
limit = 4.0
dimensions = global: global
Thank you for submitting Case # (ID:f...28d) to Google Cloud Platform support for the following quota:
Change GPUs (all regions) from 4 to 8
add example workload config for G2 in general - for cuda/tensorflow/keras/llm training/inference https://cloud.google.com/blog/products/compute/introducing-g2-vms-with-nvidia-l4-gpus
Alternate NVidia workstation deployment is already working via marketplace https://console.cloud.google.com/marketplace/product/nvidia/nvidia-rtx-virtual-workstation-windows-server-2022
L4 GPUs per G2 VM