ML/AI on various platforms

NVidia cuda image

https://docs.nvidia.com/datacenter/cloud-native/#overview
https://hub.docker.com/r/nvidia/cuda/tags
Dockerfile
```
FROM nvidia/cuda:12.2.0-devel-ubi8
CMD nvidia-smi
```

docker build -t nvidia-smi .
docker run --rm --gpus all nvidia-smi 

on older Lenovo P53 P1000 Pascal GP107
Mon Oct  9 20:23:27 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.01              Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro P1000                   On  | 00000000:01:00.0  On |                  N/A |
| N/A   45C    P8              N/A /  20W |    543MiB /  4096MiB |      4%      Default |

on Lenovo P17 gen 1 RTX-5000 Turing TU104
CUDA Version 12.2.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Mon Oct  9 22:59:17 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.01              Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 5000                On  | 00000000:01:00.0  On |                  N/A |
| N/A   53C    P8              14W / 110W |   1565MiB / 16384MiB |      4%      Default |

The key to GPU passthrough to docker is the --gpus variable - if you don't set it you will get the following

$ docker run --rm nvidia-smi

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

/bin/sh: nvidia-smi: command not found

Ada RTX-3500 on P1 Gen 6 - 202311 - AD104 5120 cores

micha@p1gen6 MINGW64 /c/wse_github/ObrienlabsDev/machine-learning/environments/windows (main)
$ git diff
diff --git a/environments/windows/src/tflow.py b/environments/windows/src/tflow.py
index a661906..bf6ad05 100644
--- a/environments/windows/src/tflow.py
+++ b/environments/windows/src/tflow.py
@@ -12,12 +12,16 @@ import tensorflow as tf
 #print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

 #NUM_GPUS = 2
-#strategy = tf.contrib.distribute.MirroredStrategy()#num_gpus=NUM_GPUS)
+##strategy = tf.contrib.distribute.MirroredStrategy()#num_gpus=NUM_GPUS)
+
 # working on dual RTX-4090
-strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
+#strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
 #WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /replica:0/task:0/device:GPU:1,/replica:0/task:0/device:GPU:0
 #Number of devices: 2

+# Working on Lenovo P1 Gen 6 - 1 3500 GPU but with gpu0 the iris embedded 13900 intel gpu
+#strategy = tf.contrib.distribute.MirroredStrategy(devices=["/gpu:1"])
+strategy = tf.distribute.MirroredStrategy()

 #central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()
 #strategy = tf.distribute.MultiWorkerMirroredStrategy() # not in tf 1.5
@@ -61,4 +65,4 @@ with strategy.scope():
   loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
 # https://keras.io/api/models/model_training_apis/
   parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
-parallel_model.fit(x_train, y_train, epochs=10, batch_size=256)#5120)#7168)#7168)
+parallel_model.fit(x_train, y_train, epochs=40, batch_size=2048)#5120)#7168)#7168)

micha@p1gen6 MINGW64 /c/wse_github/ObrienlabsDev/machine-learning/environments/windows (main)

4 min 44

1024

parallel_model.fit(x_train, y_train, epochs=40, batch_size=1024)#5120)#7168)#7168

Epoch 25/40
49/49 [==============================] - 5s 107ms/step - loss: 0.6931 - accuracy: 0.8159
Epoch 26/40
49/49 [==============================] - 5s 107ms/step - loss: 0.6341 - accuracy: 0.8317
Epoch 27/40
49/49 [==============================] - 5s 107ms/step - loss: 2.5129 - accuracy: 0.4004
Epoch 28/40
49/49 [==============================] - 5s 107ms/step - loss: 2.4635 - accuracy: 0.3868
Epoch 29/40
49/49 [==============================] - 5s 107ms/step - loss: 2.0855 - accuracy: 0.4571
Epoch 30/40
49/49 [==============================] - 5s 107ms/step - loss: 2.2108 - accuracy: 0.4369
Epoch 31/40
49/49 [==============================] - 5s 108ms/step - loss: 1.5757 - accuracy: 0.5663
Epoch 32/40
49/49 [==============================] - 5s 107ms/step - loss: 1.3286 - accuracy: 0.6448
Epoch 33/40
49/49 [==============================] - 5s 107ms/step - loss: 1.2228 - accuracy: 0.6600
Epoch 34/40
49/49 [==============================] - 5s 107ms/step - loss: 0.6476 - accuracy: 0.8269
Epoch 35/40
49/49 [==============================] - 5s 107ms/step - loss: 0.3602 - accuracy: 0.9112
Epoch 36/40
49/49 [==============================] - 5s 107ms/step - loss: 0.2305 - accuracy: 0.9519
Epoch 37/40
49/49 [==============================] - 5s 107ms/step - loss: 0.1606 - accuracy: 0.9721
Epoch 38/40
49/49 [==============================] - 5s 108ms/step - loss: 0.2328 - accuracy: 0.9497
Epoch 39/40
49/49 [==============================] - 5s 108ms/step - loss: 0.5184 - accuracy: 0.8651
Epoch 40/40
49/49 [==============================] - 5s 108ms/step - loss: 0.2425 - accuracy: 0.9458

CPU i7 13900H laptop

strategy = tf.distribute.OneDeviceStrategy(device="/cpu")
parallel_model.fit(x_train, y_train, epochs=40, batch_size=32)#5120)#7168)#7168)

1501, 32 batch
1504:51 - at 545ms per step and 196 batch x 30 = 50 min

Your kernel may have been built without NUMA support.
2023-11-25 20:01:04.695421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2022] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-11-25 20:01:04.695440: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 20:01:04.695462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9595 MB memory:  -> device: 0, name: NVIDIA RTX 3500 Ada Generation Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 5s 0us/step
2023-11-25 20:01:12.380426: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/30
2023-11-25 20:01:18.817300: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f7da8052cc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-11-25 20:01:18.817328: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-11-25 20:01:18.861630: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1700942478.974538      61 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2023-11-25 20:01:18.977277: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
2023-11-25 20:01:18.978925: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
2023-11-25 20:01:19.199491: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
2023-11-25 20:01:19.201277: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
2023-11-25 20:01:19.204641: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
  3/196 [..............................] - ETA: 1:40 - loss: 6.3933 - accuracy: 0.01302023-11-25 20:01:21.324943: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
 40/196 [=====>........................] - ETA: 1:23 - loss: 4.9843 - accuracy: 0.02962023-11-25 20:01:41.031136: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
196/196 [==============================] - 114s 545ms/step - loss: 4.2022 - accuracy: 0.0953
Epoch 2/30
196/196 [==============================] - 105s 537ms/step - loss: 3.6075 - accuracy: 0.1890
Epoch 3/30
159/196 [=======================>......] - ETA: 20s - loss: 3.8961 - accuracy: 0.1589

CPU i9 13000K laptop

i9-13900KS at 6.2 GHZ single RTX-A4500 card Asus Z790 Hero with 1600watt supply, 6400 dual 32g ram on XMP I 4096 batch/25

2023-12-29 17:49:12.793423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2022] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-12-29 17:49:12.793436: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-29 17:49:12.793447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 17782 MB memory:  -> device: 0, name: NVIDIA RTX A4500, pci bus id: 0000:01:00.0, compute capability: 8.6
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 35s 0us/step
2023-12-29 17:49:51.160147: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/25
2023-12-29 17:49:56.694969: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-12-29 17:50:00.947534: I external/local_xla/xla/service/service.cc:168] XLA service 0x7fb97425ce20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-12-29 17:50:00.947561: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA RTX A4500, Compute Capability 8.6
2023-12-29 17:50:00.950948: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1703872200.993064     103 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
13/13 [==============================] - 41s 808ms/step - loss: 5.4299 - accuracy: 0.0278
Epoch 2/25
13/13 [==============================] - 4s 285ms/step - loss: 4.2069 - accuracy: 0.0712
Epoch 3/25
13/13 [==============================] - 4s 285ms/step - loss: 3.8504 - accuracy: 0.1172
Epoch 4/25
13/13 [==============================] - 4s 285ms/step - loss: 3.4829 - accuracy: 0.1750
Epoch 5/25
13/13 [==============================] - 4s 286ms/step - loss: 3.1631 - accuracy: 0.2380
Epoch 6/25
13/13 [==============================] - 4s 286ms/step - loss: 2.7725 - accuracy: 0.3111
Epoch 7/25
13/13 [==============================] - 4s 286ms/step - loss: 2.3888 - accuracy: 0.3901
Epoch 8/25
13/13 [==============================] - 4s 287ms/step - loss: 2.0793 - accuracy: 0.4557
Epoch 9/25
13/13 [==============================] - 4s 287ms/step - loss: 1.8113 - accuracy: 0.5219
Epoch 10/25
13/13 [==============================] - 4s 288ms/step - loss: 1.5876 - accuracy: 0.5753
Epoch 11/25
13/13 [==============================] - 4s 288ms/step - loss: 1.3336 - accuracy: 0.6312
Epoch 12/25
13/13 [==============================] - 4s 288ms/step - loss: 1.0699 - accuracy: 0.6984
Epoch 13/25
13/13 [==============================] - 4s 289ms/step - loss: 0.9236 - accuracy: 0.7364
Epoch 14/25
13/13 [==============================] - 4s 289ms/step - loss: 0.7571 - accuracy: 0.7804
Epoch 15/25
13/13 [==============================] - 4s 290ms/step - loss: 0.6041 - accuracy: 0.8242
Epoch 16/25
13/13 [==============================] - 4s 290ms/step - loss: 0.6497 - accuracy: 0.8138
Epoch 17/25
13/13 [==============================] - 4s 290ms/step - loss: 0.5552 - accuracy: 0.8316
Epoch 18/25
13/13 [==============================] - 4s 290ms/step - loss: 0.4580 - accuracy: 0.8647
Epoch 19/25
13/13 [==============================] - 4s 290ms/step - loss: 0.3844 - accuracy: 0.8903
Epoch 20/25
13/13 [==============================] - 4s 290ms/step - loss: 0.3997 - accuracy: 0.8838
Epoch 21/25
13/13 [==============================] - 4s 290ms/step - loss: 0.3681 - accuracy: 0.8954
Epoch 22/25
13/13 [==============================] - 4s 291ms/step - loss: 0.3103 - accuracy: 0.9070
Epoch 23/25
13/13 [==============================] - 4s 290ms/step - loss: 0.2674 - accuracy: 0.9209
Epoch 24/25
13/13 [==============================] - 4s 291ms/step - loss: 0.3407 - accuracy: 0.9027
Epoch 25/25
13/13 [==============================] - 4s 291ms/step - loss: 0.3117 - accuracy: 0.9118

ObrienlabsDev / machine-learning

ML/AI on various platforms #1

NVidia cuda image

Dockerfile

Ada RTX-3500 on P1 Gen 6 - 202311 - AD104 5120 cores

CPU i7 13900H laptop

CPU i9 13000K laptop