ObrienlabsDev / blog

Blogs and Wiki
Apache License 2.0
1 stars 0 forks source link

tensorflow on OSX Mac M1 pro/max silicon 32 cores and windows 11 13900k with dual RTX-A4500/A4000 workstation and dual GTX-4090 consumer #13

Open obriensystems opened 1 year ago

obriensystems commented 1 year ago

GTX-4090 Ada generation consumer cards

Screenshot 2023-10-29 at 13 44 47 Screenshot 2023-10-29 at 13 46 12

RTX-A4500 Ampere generation professional workstation cards

Screenshot 2023-10-29 at 13 39 43 Screenshot 2023-10-29 at 13 41 37 Screenshot 2023-10-29 at 13 42 27 Screenshot 2023-10-29 at 13 43 30

Stats

https://github.com/ObrienlabsDev/blog/wiki/Machine-Learning-on-local-or-Cloud-based-NVidia-or-Apple-GPUs

Note: CPU is 340% CPU only and 100% GPU therefore 100% is CPU overhead

Mac Mini 2020 M1

Macbook Pro 14 M1 Pro 16G 4p/4e 8 core GPU = 516ms/step CPU at 50%, 79ms/step GPU = 6.5x faster GPU Macbook Pro 16 M1 Pro 32G 8p/1e 32 core GPU = 437ms/step CPU at 50%, 54ms/step GPU = 10.5x faster GPU and 1.2x/1.5x faster than M1 Pro (49ms using 32 batch down from 64 - matching GPU size - 2.4/32G vram)

Lenovo P17 Gen1 128g RTX-5000 TU104 using batch of 5120 = 190us and 15.6/16G vram

follow https://developer.apple.com/metal/tensorflow-plugin/ better https://www.mrdbourke.com/setup-apple-m1-pro-and-m1-max-for-machine-learning-and-data-science/

base system M1 Pro 4p/4e 8 core GPU


  181  ./Miniforge3-MacOSX-arm64.sh
  183  source ~/miniforge3/bin/activate
  185  cd wse_github
  186  cd tensor
  187  mkdir tensorflow-test
  188  cd tensorflow-test
  189  conda create --prefix ./env python=3.8 
  190  conda activate ./env
  191  conda install -c apple tensorflow-deps
  192  python -m pip install tensorflow-macos
  193  python -m pip install tensorflow-metal
  194  python -m pip install tensorflow-datasets
  195  conda install jupyter pandas numpy matplotlib scikit-learn
  196  jupyter notebook

other tab
vi tftest.py
 python tftest.py
 pip install numpy
 pip install pandas
 pip install sklearn
 pip install -U scikit-learn scipy matplotlib
 pip install -U tensorflow

(base) ..ien@mbp6 tensorflow-test % python tftest.py         
TensorFlow has access to the following devices:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
TensorFlow version: 2.14.0

missing gpu
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

try
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

(base) michaelobrien@mbp6 tensorflow-test % python gpu.py   
[]
obriensystems commented 1 year ago

running from https://developer.apple.com/metal/tensorflow-plugin/

only using cpu on m1 pro - checking m1-max

Screenshot 2023-09-28 at 23 11 46
import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

(base) michaelobrien@mbp6 tensorflow-test % python tflow.py 
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 6s 0us/step
Epoch 1/5
205/782 [======>.......................] - ETA: 4:56 - loss: 5.1091 - accuracy: 0.0474

782/782 [==============================] - 412s 525ms/step - loss: 4.7090 - accuracy: 0.0739
Epoch 2/5
372/782 [=============>................] - ETA: 3:37 - loss: 4.2690 - accuracy: 0.1032

(base) michaelobrien@mbp6 tensorflow-test % python tflow.py 
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 6s 0us/step
Epoch 1/5
782/782 [==============================] - 412s 525ms/step - loss: 4.7090 - accuracy: 0.0739
Epoch 2/5
782/782 [==============================] - 415s 530ms/step - loss: 4.3030 - accuracy: 0.1040
Epoch 3/5
782/782 [==============================] - 414s 529ms/step - loss: 4.1217 - accuracy: 0.1116
Epoch 4/5
782/782 [==============================] - 404s 516ms/step - loss: 3.8518 - accuracy: 0.1589
Epoch 5/5
782/782 [==============================] - 403s 516ms/step - loss: 3.5352 - accuracy: 0.1952
obriensystems commented 1 year ago

try https://github.com/apple/tensorflow_macos to get the gpu working on m1 mac

(/Users/michaelobrien/wse_github/tensor/tensorflow-test/env) michaelobrien@mbp6 tensorflow-test % /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/apple/tensorflow_macos/master/scripts/download_and_install.sh)" 

ERROR: TensorFlow with ML Compute acceleration is only available on macOS 11.0 and later.

i have 13.5

actually that repo is 2y old and links back to https://developer.apple.com/metal/tensorflow-plugin/

(/Users/michaelobrien/wse_github/tensor/tensorflow-test/env) michaelobrien@mbp6 tensorflow-test % python -m pip install tensorflow-metal
Requirement already satisfied: tensorflow-metal in ./env/lib/python3.8/site-packages (1.0.1)
Requirement already satisfied: wheel~=0.35 in ./env/lib/python3.8/site-packages (from tensorflow-metal) (0.41.2)
Requirement already satisfied: six>=1.15.0 in ./env/lib/python3.8/site-packages (from tensorflow-metal) (1.16.0)
obriensystems commented 1 year ago

try again

(base) michaelobrien@mbp6 tensorflow-test % pip install tensorflow-metal
Collecting tensorflow-metal
  Obtaining dependency information for tensorflow-metal from https://files.pythonhosted.org/packages/f3/3d/0796dda099a84e166aacb493f8a161c8816175e514e79012b940364787d4/tensorflow_metal-1.0.1-cp310-cp310-macosx_12_0_arm64.whl.metadata
  Downloading tensorflow_metal-1.0.1-cp310-cp310-macosx_12_0_arm64.whl.metadata (1.2 kB)
Requirement already satisfied: wheel~=0.35 in /Users/michaelobrien/miniforge3/lib/python3.10/site-packages (from tensorflow-metal) (0.41.2)
Requirement already satisfied: six>=1.15.0 in /Users/michaelobrien/miniforge3/lib/python3.10/site-packages (from tensorflow-metal) (1.16.0)
Downloading tensorflow_metal-1.0.1-cp310-cp310-macosx_12_0_arm64.whl (1.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 13.9 MB/s eta 0:00:00
Installing collected packages: tensorflow-metal
Successfully installed tensorflow-metal-1.0.1
(base) michaelobrien@mbp6 tensorflow-test % python gpu.py               
Traceback (most recent call last):
  File "/Users/michaelobrien/wse_github/tensor/tensorflow-test/gpu.py", line 1, in <module>
    import tensorflow as tf
  File "/Users/michaelobrien/miniforge3/lib/python3.10/site-packages/tensorflow/__init__.py", line 445, in <module>
    _ll.load_library(_plugin_dir)
  File "/Users/michaelobrien/miniforge3/lib/python3.10/site-packages/tensorflow/python/framework/load_library.py", line 151, in load_library
    py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: dlopen(/Users/michaelobrien/miniforge3/lib/python3.10/site-packages/tensorflow-plugins/libmetal_plugin.dylib, 0x0006): Symbol not found: __ZN10tensorflow16TensorShapeProtoC1ERKS0_
  Referenced from: <10B7FC95-0B10-3E4E-84D0-79A2D52E4D78> /Users/michaelobrien/miniforge3/lib/python3.10/site-packages/tensorflow-plugins/libmetal_plugin.dylib
  Expected in:     <C104091C-297A-300E-A02F-509BCA2330E3> /Users/michaelobrien/miniforge3/lib/python3.10/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

retry using https://developer.apple.com/forums/thread/689300

better - see cpu and gpu

Screenshot 2023-09-29 at 00 08 52
reverted from 2.14 to 2.9
conda install -c apple tensorflow-deps==2.9.0
python -m pip install tensorflow-macos==2.9.0
python -m pip install tensorflow-metal==0.5.0

6.5x faster with GPU on the lowest gpu in a m1 pro
2023-09-29 00:07:57.790054: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-29 00:07:57.790208: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-29 00:07:59.290834: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/5
2023-09-29 00:08:01.285460: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
782/782 [==============================] - 75s 84ms/step - loss: 4.9904 - accuracy: 0.0426
Epoch 2/5
557/782 [====================>.........] - ETA: 17s - loss: 4.5287 - accuracy: 0.0735 

2023-09-29 00:07:57.790054: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-29 00:07:57.790208: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-29 00:07:59.290834: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/5
2023-09-29 00:08:01.285460: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
782/782 [==============================] - 75s 84ms/step - loss: 4.9904 - accuracy: 0.0426
Epoch 2/5
782/782 [==============================] - 62s 80ms/step - loss: 4.5035 - accuracy: 0.0736
Epoch 3/5
782/782 [==============================] - 63s 81ms/step - loss: 4.0429 - accuracy: 0.1108
Epoch 4/5
782/782 [==============================] - 62s 80ms/step - loss: 3.8094 - accuracy: 0.1470
Epoch 5/5
782/782 [==============================] - 62s 79ms/step - loss: 3.7574 - accuracy: 0.1545
(base) michaelobrien@mbp6 tensorflow-test % python gpu.py                                
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

(base) michaelobrien@mbp6 tensorflow-test % python tftest.py 
TensorFlow has access to the following devices:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
TensorFlow version: 2.9.0
obriensystems commented 1 year ago

code

(base) michaelobrien@mbp6 tensorflow-test % cat tflow.py 
import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

(base) michaelobrien@mbp6 tensorflow-test % cat tftest.py 
import numpy as np
import pandas as pd
import sklearn
import tensorflow as tf
import matplotlib.pyplot as plt

# Check for TensorFlow GPU access
print(f"TensorFlow has access to the following devices:\n{tf.config.list_physical_devices()}")

# See TensorFlow version
print(f"TensorFlow version: {tf.__version__}")

(base) michaelobrien@mbp6 tensorflow-test % cat gpu.py 
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
obriensystems commented 1 year ago

Restest on Macbook Pro M1 Max 32g 8p/2e 32core GPU

https://developer.apple.com/metal/tensorflow-plugin/

Test only CPU first

michaelobrien@mbp7 tensorflow % python3 -m venv ~/venv-metal
michaelobrien@mbp7 tensorflow % source ~/venv-metal/bin/activate
(venv-metal) michaelobrien@mbp7 tensorflow % python -m pip install -U pip
(venv-metal) michaelobrien@mbp7 tensorflow % python -m pip install tensorflow
(venv-metal) michaelobrien@mbp7 tensorflow % python tflow.py 
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 6s 0us/step
Epoch 1/5
782/782 [==============================] - 349s 443ms/step - loss: 4.5777 - accuracy: 0.0837
Epoch 2/5
782/782 [==============================] - 341s 437ms/step - loss: 4.0482 - accuracy: 0.1339
Epoch 3/5
782/782 [==============================] - 346s 442ms/step - loss: 3.8610 - accuracy: 0.1711
Epoch 4/5
782/782 [==============================] - 343s 439ms/step - loss: 3.5612 - accuracy: 0.1989
Epoch 5/5
782/782 [==============================] - 344s 440ms/step - loss: 4.1587 - accuracy: 0.1126
Screenshot 2023-09-29 at 20 40 22

Adding GPU capability

54ms/step

python -m pip install tensorflow-metal
Successfully installed tensorflow-metal-1.1.0
(venv-metal) michaelobrien@mbp7 tensorflow % python tflow.py                       
2023-09-29 22:10:24.617100: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Max
2023-09-29 22:10:24.617127: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-09-29 22:10:24.617133: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-09-29 22:10:24.617185: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-29 22:10:24.617357: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Epoch 1/5
2023-09-29 22:10:29.517374: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
782/782 [==============================] - 58s 59ms/step - loss: 4.9364 - accuracy: 0.0567   
Epoch 2/5
782/782 [==============================] - 44s 56ms/step - loss: 4.5137 - accuracy: 0.0699
Epoch 3/5
782/782 [==============================] - 42s 54ms/step - loss: 4.0512 - accuracy: 0.1155
Epoch 4/5
782/782 [==============================] - 44s 56ms/step - loss: 3.9736 - accuracy: 0.1288
Epoch 5/5
782/782 [==============================] - 43s 56ms/step - loss: 3.7527 - accuracy: 0.1531     
Screenshot 2023-09-29 at 22 11 49
obriensystems commented 1 year ago

on slowest windows without GPU - Lenovo X1 Carbon Gen9 66%cpu on i7-1185G7 3ghz (3.5ghz)

python -m venv ~/venv-tensor
source ~/venv-tensor/scripts/activate
python -m pip install -U pip
python -m pip install tensorflow
python tflow.py 

2023-09-29 22:52:34.975337: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

 73/782 [=>............................] - ETA: 15:39 - loss: 5.5396 - accuracy: 0.0289
obriensystems commented 1 year ago

on fastest windows laptop - Lenovo P17 gen 1 - Xeon W-10855M 2.8Ghz with RTX-5000 (TU-104)

https://www.tensorflow.org/install/pip#windows-native

python -m venv ~/venv-tensor
source ~/venv-tensor/scripts/activate
python -m pip install -U pip
python -m pip install tensorflow

python tflow.py

2023-09-29 23:02:40.581369: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/5

219/782 [=======>......................] - ETA: 11:23 - loss: 4.9518 - accuracy: 0.0532

GPU support in Windows WSL https://www.tensorflow.org/install/pip#windows-wsl2

python3 -m pip install tensorflow[and-cuda]
Collecting nvidia-cublas-cu11==11.11.3.6 (from tensorflow[and-cuda])
  Downloading nvidia_cublas_cu11-11.11.3.6-py3-none-win_amd64.whl (427.2 MB)
     -------------------------------------- 427.2/427.2 MB 7.7 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu11==11.8.87 (from tensorflow[and-cuda])
  Downloading nvidia_cuda_cupti_cu11-11.8.87-py3-none-win_amd64.whl (10.0 MB)
     --------------------------------------- 10.0/10.0 MB 21.9 MB/s eta 0:00:00
Collecting nvidia-cuda-nvcc-cu11==11.8.89 (from tensorflow[and-cuda])
  Downloading nvidia_cuda_nvcc_cu11-11.8.89-py3-none-win_amd64.whl (15.7 MB)
     --------------------------------------- 15.7/15.7 MB 25.1 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu11==11.8.89 (from tensorflow[and-cuda])
  Downloading nvidia_cuda_runtime_cu11-11.8.89-py3-none-win_amd64.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 16.3 MB/s eta 0:00:00

Successfully installed gast-0.4.0 keras-2.13.1 numpy-1.24.3 tensorboard-2.13.0 tensorflow-2.13.1 tensorflow-estimator-2.13.0 tensorflow-intel-2.13.1 typing-extensions-4.5.0

not yet
$ python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
[]
obriensystems commented 1 year ago

windows tensorflow 1.15 only using directml https://learn.microsoft.com/en-us/windows/ai/directml/gpu-tensorflow-wsl

   29  conda create --name directml python=3.6
   32  conda activate directml
   33  conda init bash
   34  conda activate directml

pip install tensorflow-directml
  WARNING: The script f2py.exe is installed in 'C:\Users\michael\AppData\Roaming\Python\Python36\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

70% 4090 - temp only rises 5c with the GPU at 40% power 180W of 450W

michael@13900b MINGW64 /c/wse_github/tensorflow
$ python tflow.py
WARNING:tensorflow:From C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 50000 samples
2023-09-30 11:05:07.151873: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python/directml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.dll
2023-09-30 11:05:07.152247: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library dxgi.dll
2023-09-30 11:05:07.154477: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library d3d12.dll
2023-09-30 11:05:07.987347: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:250] DirectML device enumeration: found 2 compatible adapters.
2023-09-30 11:05:07.987646: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2023-09-30 11:05:07.989784: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:186] DirectML: creating device on adapter 0 (NVIDIA GeForce RTX 4090)
2023-09-30 11:05:08.057829: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library Kernel32.dll
2023-09-30 11:05:08.058325: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:186] DirectML: creating device on adapter 1 (NVIDIA GeForce RTX 4090)
Epoch 1/5
50000/50000 [==============================] - 13s 261us/sample - loss: 4.6415 - acc: 0.0717
Epoch 2/5
50000/50000 [==============================] - 11s 221us/sample - loss: 4.0632 - acc: 0.1264
Epoch 3/5
50000/50000 [==============================] - 11s 220us/sample - loss: 3.9804 - acc: 0.1411
Epoch 4/5
50000/50000 [==============================] - 11s 220us/sample - loss: 3.5434 - acc: 0.1907
Epoch 5/5
50000/50000 [==============================] - 11s 220us/sample - loss: 3.2532 - acc: 0.2384
(directml)
michael@13900b MINGW64 /c/wse_github/tensorflow

image

change the batch size to 128 from 64 and to 10 epochs from 5 - power from 185 to 226 (idle 38)

$ cat tflow.py
import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=10, batch_size=128)

Epoch 1/10
50000/50000 [==============================] - 10s 191us/sample - loss: 4.4179 - acc: 0.0768
Epoch 2/10
50000/50000 [==============================] - 7s 141us/sample - loss: 4.1419 - acc: 0.1027
Epoch 3/10
50000/50000 [==============================] - 7s 141us/sample - loss: 3.8350 - acc: 0.1345
Epoch 4/10
50000/50000 [==============================] - 7s 144us/sample - loss: 3.5075 - acc: 0.1830
Epoch 5/10
50000/50000 [==============================] - 7s 145us/sample - loss: 3.4506 - acc: 0.2024
Epoch 6/10
50000/50000 [==============================] - 7s 145us/sample - loss: 3.1387 - acc: 0.2477
Epoch 7/10
50000/50000 [==============================] - 7s 144us/sample - loss: 3.1950 - acc: 0.2530
Epoch 8/10
50000/50000 [==============================] - 7s 143us/sample - loss: 2.8823 - acc: 0.3014
Epoch 9/10
50000/50000 [==============================] - 7s 143us/sample - loss: 2.6766 - acc: 0.3380
Epoch 10/10
50000/50000 [==============================] - 7s 142us/sample - loss: 2.4959 - acc: 0.3769

image

adjust the batch size based on the gpu - in this case AD102 at 16384 cores

batch = 1024

Epoch 1/10
50000/50000 [==============================] - 6s 117us/sample - loss: 4.5212 - acc: 0.0587
Epoch 2/10
50000/50000 [==============================] - 3s 59us/sample - loss: 3.6212 - acc: 0.1565
Epoch 3/10
50000/50000 [==============================] - 3s 58us/sample - loss: 3.1855 - acc: 0.2325
Epoch 4/10
50000/50000 [==============================] - 3s 58us/sample - loss: 2.8579 - acc: 0.2948
Epoch 5/10
50000/50000 [==============================] - 3s 58us/sample - loss: 2.4748 - acc: 0.3695
Epoch 6/10
50000/50000 [==============================] - 3s 58us/sample - loss: 2.1747 - acc: 0.4333
Epoch 7/10
50000/50000 [==============================] - 3s 58us/sample - loss: 2.0401 - acc: 0.4645
Epoch 8/10
50000/50000 [==============================] - 3s 58us/sample - loss: 2.3375 - acc: 0.3928
Epoch 9/10
50000/50000 [==============================] - 3s 58us/sample - loss: 1.7525 - acc: 0.5260
Epoch 10/10

350 watts

image

the max with 400 watt peaks is 4096 batch image

import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=10, batch_size=4096)

Epoch 1/10
50000/50000 [==============================] - 5s 100us/sample - loss: 5.3493 - acc: 0.0235
Epoch 2/10
50000/50000 [==============================] - 3s 52us/sample - loss: 4.2236 - acc: 0.0655
Epoch 3/10
50000/50000 [==============================] - 3s 51us/sample - loss: 3.8262 - acc: 0.1216
Epoch 4/10
50000/50000 [==============================] - 3s 51us/sample - loss: 3.4908 - acc: 0.1787
Epoch 5/10
50000/50000 [==============================] - 3s 51us/sample - loss: 3.1391 - acc: 0.2404
Epoch 6/10
50000/50000 [==============================] - 3s 52us/sample - loss: 2.9317 - acc: 0.2797
Epoch 7/10
50000/50000 [==============================] - 3s 51us/sample - loss: 2.7100 - acc: 0.3233
Epoch 8/10
50000/50000 [==============================] - 3s 51us/sample - loss: 3.1979 - acc: 0.2347
Epoch 9/10
50000/50000 [==============================] - 3s 51us/sample - loss: 2.9408 - acc: 0.2769
Epoch 10/10
50000/50000 [==============================] - 3s 51us/sample - loss: 2.5449 - acc: 0.3592
obriensystems commented 1 year ago

force an OOM on 23/24G by using 8192 batch size on 16384 processor

model.fit(x_train, y_train, epochs=100, batch_size=10240)#7168)

2023-10-01 08:23:08.889749: I tensorflow/core/common_runtime/bfc_allocator.cc:943] Sum Total of in-use chunks: 21.66GiB
2023-10-01 08:23:08.889778: I tensorflow/core/common_runtime/bfc_allocator.cc:945] total_region_allocated_bytes_: 23297890560 memory_limit_: 23297890714 available bytes: 154 curr_region_allocation_bytes_: 4293918720
2023-10-01 08:23:08.889821: I tensorflow/core/common_runtime/bfc_allocator.cc:951] Stats:
Limit:                 23297890714
InUse:                 23255949312
MaxInUse:              23256997888
NumAllocs:                    1532
MaxAllocSize:            849346560

2023-10-01 08:23:08.889962: W tensorflow/core/common_runtime/bfc_allocator.cc:446] ****************************************************************************************************
2023-10-01 08:23:08.890007: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at dml_kernel_context.cc:132 : Resource exhausted: OOM when allocating tensor with shape[10240,1024,2,2] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
Traceback (most recent call last):
  File "tflow.py", line 65, in <module>
    model.fit(x_train, y_train, epochs=100, batch_size=10240)#7168)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\engine\training.py", line 727, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 675, in fit
    steps_name='steps_per_epoch')
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 394, in model_iteration
    batch_outs = f(ins_batch)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\backend.py", line 3476, in __call__
    run_metadata=self.run_metadata)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\client\session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[10240,1024,2,2] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[{{node conv4_block3_3_bn/cond/FusedBatchNormV3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[loss/mul/_1553]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[10240,1024,2,2] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[{{node conv4_block3_3_bn/cond/FusedBatchNormV3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
(directml)

Also issues using multi GPU on DirectML https://github.com/microsoft/tensorflow-directml/issues/352 https://learn.microsoft.com/en-us/windows/ai/directml/gpu-faq

import tensorflow as tf

# https://www.tensorflow.org/guide/distributed_training
#
# https://www.tensorflow.org/tutorials/distribute/keras
#strategy = tf.distribute.MirroredStrategy()
#print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

#NUM_GPUS = 2
#strategy = tf.contrib.distribute.MirroredStrategy()#num_gpus=NUM_GPUS)
# not working
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
#WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /replica:0/task:0/device:GPU:1,/replica:0/task:0/device:GPU:0
#Number of devices: 2

#central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()
#strategy = tf.distribute.MultiWorkerMirroredStrategy() # not in tf 1.5
#print("mirrored_strategy: ",mirrored_strategy)
#strategy = tf.distribute.OneDeviceStrategy(device="/gpu:1")
#mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps(all_reduce_alg="hierarchical_copy"))
#mirrored_strategy = tf.distribute.MirroredStrategy(devices= ["/gpu:0","/gpu:1"],cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()

#with strategy.scope():
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=100, batch_size=7168)
obriensystems commented 1 year ago

Multi GPU specific to DirectML https://learn.microsoft.com/en-us/windows/ai/directml/gpu-faq image

or

# https://learn.microsoft.com/en-us/windows/ai/directml/gpu-faq
a = tf.constant([1.])
b = tf.constant([2.])
c = tf.add(a, b)

gpu_config = tf.GPUOptions()
gpu_config.visible_device_list = "1"#"0"

session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_config))
print(session.run(c))

https://github.com/tensorflow/tensorflow/issues/19083 https://github.com/tensorflow/tensorflow/issues/18861#issuecomment-388454669

michael@13900b MINGW64 /c/wse_github/tensorflow
$ python tflow.py
WARNING:tensorflow:From tflow.py:32: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

WARNING:tensorflow:From tflow.py:36: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From tflow.py:36: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2023-09-30 22:42:26.532728: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2023-09-30 22:42:26.535407: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python/directml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.dll
2023-09-30 22:42:26.535750: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library dxgi.dll
2023-09-30 22:42:26.537833: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library d3d12.dll
2023-09-30 22:42:27.425768: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:250] DirectML device enumeration: found 2 compatible adapters.
2023-09-30 22:42:27.425872: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:186] DirectML: creating device on adapter 1 (NVIDIA GeForce RTX 4090)
2023-09-30 22:42:27.490844: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library Kernel32.dll
[3.]
WARNING:tensorflow:From C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 50000 samples
2023-09-30 22:42:30.497139: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:186] DirectML: creating device on adapter 0 (NVIDIA GeForce RTX 4090)
Traceback (most recent call last):
  File "tflow.py", line 53, in <module>
    model.fit(x_train, y_train, epochs=100, batch_size=7168)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\engine\training.py", line 727, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 675, in fit
    steps_name='steps_per_epoch')
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 271, in model_iteration
    model.reset_metrics()
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\engine\training.py", line 914, in reset_metrics
    m.reset_states()
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\metrics.py", line 210, in reset_states
    K.batch_set_value([(v, 0) for v in self.variables])
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\backend.py", line 3259, in batch_set_value
    get_session().run(assign_ops, feed_dict=feed_dict)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\backend.py", line 483, in get_session
    session = _get_session(op_input_list)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\keras\backend.py", line 455, in _get_session
    config=get_default_session_config())
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\client\session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "C:\Users\michael\AppData\Roaming\Python\Python36\site-packages\tensorflow_core\python\client\session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (DML:0) is being mapped to multiple DML devices (0 now, and 1 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not  currently supported, see https://github.com/tensorflow/tensorflow/issues/19083
(directml)

check https://keras.io/guides/distributed_training/

obriensystems commented 1 year ago

move distribution to #15

obriensystems commented 1 year ago

on Lenovo P17 Gen 1 128g RTX-5000 (TU-104)

conda create --name directml python=3.6
conda init bash
cd /c/_dev/tensorflow/
conda activate directml
pip install tensorflow-directml
python tflow.py

model.fit(x_train, y_train, epochs=5, batch_size=64)

micha@LAPTOP-M4VQDR8K MINGW64 /c/_dev/tensorflow
$ python tflow.py
WARNING:tensorflow:From C:\Users\micha\.conda\envs\directml\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2023-10-01 09:02:10.098122: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library C:\Users\micha\.conda\envs\directml\lib\site-packages\tensorflow_core\python/directml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.dll
2023-10-01 09:02:10.098534: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library dxgi.dll
2023-10-01 09:02:10.102097: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library d3d12.dll
2023-10-01 09:02:10.407508: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:250] DirectML device enumeration: found 1 compatible adapters.
2023-10-01 09:02:10.407769: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2023-10-01 09:02:10.410201: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:186] DirectML: creating device on adapter 0 (NVIDIA Quadro RTX 5000)
2023-10-01 09:02:10.515652: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library Kernel32.dll
Train on 50000 samples
Epoch 1/5

   64/50000 [..............................] - ETA: 44:34 - loss: 6.1871 - acc: 0.0156
  128/50000 [..............................] - ETA: 22:58 - loss: 6.8958 - acc: 0.0156
  256/50000 [..............................] - ETA: 11:46 - loss: 7.2895 - acc: 0.0078
  384/50000 [..............................] - ETA: 7:58 - loss: 7.0193 - acc: 0.0130

2.4/16G vram
50000/50000 [==============================] - 35s 707us/sample - loss: 4.6432 - acc: 0.0621

5.6/16G vram
model.fit(x_train, y_train, epochs=10, batch_size=1024)
50000/50000 [==============================] - 16s 322us/sample - loss: 4.5699 - acc: 0.0504

15.6.16G VRAM
model.fit(x_train, y_train, epochs=10, batch_size=7168)
will break OOM as this only fits a 24G RTX-4090
2023-10-01 09:05:51.909980: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at dml_kernel_wrapper.cc:188 : Resource exhausted: OOM when allocating a buffer of 1048576 bytes

2023-10-01 09:07:31.672303: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at dml_kernel_context.cc:135 : Resource exhausted: OOM when allocating tensor with shape[6144,1024,2,2] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator

13.4/16G
model.fit(x_train, y_train, epochs=10, batch_size=4096)

50000/50000 [==============================] - 9s 185us/sample - loss: 3.4452 - acc: 0.1854

15.6/16G
batch 5120
50000/50000 [==============================] - 9s 183us/sample - loss: 4.2929 - acc: 0.0615

model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=10, batch_size=5120)
image
obriensystems commented 1 year ago

look at steps_per_epoch in keras.model.fit to align M1 and RTX https://stackoverflow.com/questions/43457862/whats-the-difference-between-samples-per-epoch-and-steps-per-epoch-in-fit-g https://saturncloud.io/blog/understanding-the-differences-between-keras-modelfitgenerator-and-modelfit/

michaelobrien@mbp7 tensorflow % python tflow.py
2023-10-01 09:32:07.121113: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Max
2023-10-01 09:32:07.121141: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-10-01 09:32:07.121146: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-10-01 09:32:07.121183: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-10-01 09:32:07.121197: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Epoch 1/10
2023-10-01 09:32:11.030326: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
250/250 [==============================] - 123s 475ms/step - loss: 3.5326 - accuracy: 0.1946
Epoch 2/10
230/250 [==========================>...] - ETA: 9s - loss: 2.8680 - accuracy: 0.30472023-10-01 09:35:59.616158: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10758936884266613144
2023-10-01 09:35:59.616172: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7954835218486006552
2023-10-01 09:35:59.616175: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 5170976072988965275
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 2500 batches). You may need to use the repeat() function when building your dataset.
250/250 [==============================] - 109s 435ms/step - loss: 2.8680 - accuracy: 0.3047
obriensystems commented 1 year ago

Training Times to get past .7 accuracy

Macbook Pro M1max 32 core

model.fit(x_train, y_train, epochs=10, batch_size=32)

2023-10-01 09:40:02.430113: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
1563/1563 [==============================] - 85s 51ms/step - loss: 5.0784 - accuracy: 0.0506   
Epoch 2/10
1563/1563 [==============================] - 81s 52ms/step - loss: 4.3543 - accuracy: 0.0882
Epoch 3/10
1563/1563 [==============================] - 79s 50ms/step - loss: 3.8541 - accuracy: 0.1380
Epoch 4/10
1563/1563 [==============================] - 79s 50ms/step - loss: 3.9430 - accuracy: 0.1335
Epoch 5/10
1563/1563 [==============================] - 80s 51ms/step - loss: 3.7356 - accuracy: 0.1476
Epoch 6/10
1563/1563 [==============================] - 78s 50ms/step - loss: 3.5188 - accuracy: 0.1733
Epoch 7/10
1563/1563 [==============================] - 77s 50ms/step - loss: 3.3642 - accuracy: 0.1959
Epoch 8/10
1563/1563 [==============================] - 78s 50ms/step - loss: 3.1934 - accuracy: 0.2211
Epoch 9/10
1563/1563 [==============================] - 79s 50ms/step - loss: 3.1185 - accuracy: 0.2339
Epoch 10/10
1563/1563 [==============================] - 80s 51ms/step - loss: 3.0404 - accuracy: 0.2505

custom RTX-4500 GA-102 16g on 13900k

custom RTX-4090 AD102 24g on 13900k

parallel_model.fit(x_train, y_train, epochs=10, batch_size=7168)#7168)
Epoch 1/10
50000/50000 [==============================] - 5s 105us/sample - loss: 5.5981 - acc: 0.0194
Epoch 2/10
50000/50000 [==============================] - 3s 51us/sample - loss: 4.4673 - acc: 0.0505
Epoch 3/10
50000/50000 [==============================] - 3s 51us/sample - loss: 4.1201 - acc: 0.0755
Epoch 4/10
50000/50000 [==============================] - 3s 51us/sample - loss: 3.8704 - acc: 0.1137
Epoch 5/10
50000/50000 [==============================] - 3s 52us/sample - loss: 3.6233 - acc: 0.1542
Epoch 6/10
50000/50000 [==============================] - 3s 51us/sample - loss: 3.3625 - acc: 0.1976
Epoch 7/10
50000/50000 [==============================] - 3s 51us/sample - loss: 3.0539 - acc: 0.2537
Epoch 8/10
50000/50000 [==============================] - 3s 51us/sample - loss: 2.7113 - acc: 0.3240
Epoch 9/10
50000/50000 [==============================] - 3s 51us/sample - loss: 2.2822 - acc: 0.4146
Epoch 10/10
50000/50000 [==============================] - 3s 51us/sample - loss: 1.8936 - acc: 0.5037
(directml)

batch_size=5120
Epoch 1/10
50000/50000 [==============================] - 5s 103us/sample - loss: 5.4113 - acc: 0.0200
Epoch 2/10
50000/50000 [==============================] - 3s 52us/sample - loss: 4.3464 - acc: 0.0573
Epoch 3/10
50000/50000 [==============================] - 3s 52us/sample - loss: 3.9781 - acc: 0.0958
Epoch 4/10
50000/50000 [==============================] - 3s 52us/sample - loss: 3.6798 - acc: 0.1425
Epoch 5/10
50000/50000 [==============================] - 3s 52us/sample - loss: 3.3581 - acc: 0.1974
Epoch 6/10
50000/50000 [==============================] - 3s 52us/sample - loss: 2.9757 - acc: 0.2714
Epoch 7/10
50000/50000 [==============================] - 3s 52us/sample - loss: 2.5354 - acc: 0.3621
Epoch 8/10
50000/50000 [==============================] - 3s 52us/sample - loss: 2.0884 - acc: 0.4622
Epoch 9/10
50000/50000 [==============================] - 3s 52us/sample - loss: 1.6150 - acc: 0.5746
Epoch 10/10
50000/50000 [==============================] - 3s 52us/sample - loss: 1.2165 - acc: 0.6736

Epoch 1/20
50000/50000 [==============================] - 5s 104us/sample - loss: 5.3118 - acc: 0.0225
Epoch 2/20
50000/50000 [==============================] - 3s 52us/sample - loss: 4.3652 - acc: 0.0505
Epoch 3/20
50000/50000 [==============================] - 3s 52us/sample - loss: 4.0122 - acc: 0.0931
Epoch 4/20
50000/50000 [==============================] - 3s 52us/sample - loss: 3.6943 - acc: 0.1422
Epoch 5/20
50000/50000 [==============================] - 3s 52us/sample - loss: 3.3622 - acc: 0.1981
Epoch 6/20
50000/50000 [==============================] - 3s 52us/sample - loss: 2.9669 - acc: 0.2736
Epoch 7/20
50000/50000 [==============================] - 3s 52us/sample - loss: 2.5067 - acc: 0.3677
Epoch 8/20
50000/50000 [==============================] - 3s 52us/sample - loss: 2.0413 - acc: 0.4755
Epoch 9/20
50000/50000 [==============================] - 3s 52us/sample - loss: 1.6177 - acc: 0.5725
Epoch 10/20
50000/50000 [==============================] - 3s 52us/sample - loss: 1.2309 - acc: 0.6696
Epoch 11/20
50000/50000 [==============================] - 3s 52us/sample - loss: 0.8803 - acc: 0.7602
Epoch 12/20
50000/50000 [==============================] - 3s 52us/sample - loss: 0.6237 - acc: 0.8299
Epoch 13/20
50000/50000 [==============================] - 3s 52us/sample - loss: 0.4394 - acc: 0.8779
Epoch 14/20
50000/50000 [==============================] - 3s 51us/sample - loss: 0.3560 - acc: 0.8989
Epoch 15/20
50000/50000 [==============================] - 3s 51us/sample - loss: 0.2832 - acc: 0.9201
Epoch 16/20
50000/50000 [==============================] - 3s 51us/sample - loss: 0.2414 - acc: 0.9318
Epoch 17/20
50000/50000 [==============================] - 3s 52us/sample - loss: 0.2158 - acc: 0.9383
Epoch 18/20
50000/50000 [==============================] - 3s 51us/sample - loss: 0.1925 - acc: 0.9452
Epoch 19/20
50000/50000 [==============================] - 3s 52us/sample - loss: 0.1823 - acc: 0.9474
Epoch 20/20
50000/50000 [==============================] - 3s 51us/sample - loss: 0.1926 - acc: 0.9432
(directml)

Lenovo P17 gen 1 RTX-5000 TU-104

model.fit(x_train, y_train, epochs=10, batch_size=5120)

Epoch 1/10
50000/50000 [==============================] - 13s 261us/sample - loss: 5.3176 - acc: 0.0190
Epoch 2/10
50000/50000 [==============================] - 9s 182us/sample - loss: 4.3495 - acc: 0.0523
Epoch 3/10
50000/50000 [==============================] - 9s 183us/sample - loss: 3.9718 - acc: 0.1005
Epoch 4/10
50000/50000 [==============================] - 9s 184us/sample - loss: 3.6286 - acc: 0.1544
Epoch 5/10
50000/50000 [==============================] - 9s 185us/sample - loss: 3.2741 - acc: 0.2139
Epoch 6/10
50000/50000 [==============================] - 9s 185us/sample - loss: 2.8738 - acc: 0.2921
Epoch 7/10
50000/50000 [==============================] - 9s 185us/sample - loss: 2.4140 - acc: 0.3881
Epoch 8/10
50000/50000 [==============================] - 9s 186us/sample - loss: 1.9310 - acc: 0.5000
Epoch 9/10
50000/50000 [==============================] - 9s 187us/sample - loss: 1.5011 - acc: 0.6009
Epoch 10/10
50000/50000 [==============================] - 9s 187us/sample - loss: 1.1043 - acc: 0.7008

40 epochs
50000/50000 [==============================] - 10s 196us/sample - loss: 0.3497 - acc: 0.9003
obriensystems commented 1 year ago

tensorflow gpu on docker for windows

docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu

python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-10-01 20:51:46.151418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1977] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-10-01 20:51:46.151427: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 20:51:46.151429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1977] Could not identify NUMA node of platform GPU id 1, defaulting to 0.  Your kernel may not have been built with NUMA support.
Your kernel may have been built without NUMA support.
2023-10-01 20:51:46.151467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21286 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-10-01 20:51:46.151853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21286 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:02:00.0, compute capability: 8.9

michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/env/windows (main)
$ docker build -t ml-tensorflow .
[+] Building 0.2s (8/8) FINISHED                                                                                                                                                                    docker:default
 => [internal] load .dockerignore                                                                                                                                                                             0.0s
 => => transferring context: 2B                                                                                                                                                                               0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                          0.0s
 => => transferring dockerfile: 277B                                                                                                                                                                          0.0s
 => [internal] load metadata for docker.io/tensorflow/tensorflow:latest-gpu                                                                                                                                   0.0s
 => [1/3] FROM docker.io/tensorflow/tensorflow:latest-gpu                                                                                                                                                     0.0s
 => [internal] load build context                                                                                                                                                                             0.0s
 => => transferring context: 3.27kB                                                                                                                                                                           0.0s
 => CACHED [2/3] WORKDIR /src                                                                                                                                                                                 0.0s
 => [3/3] COPY /src/tflow.py .                                                                                                                                                                                0.0s
 => exporting to image                                                                                                                                                                                        0.0s
 => => exporting layers                                                                                                                                                                                       0.0s
 => => writing image sha256:fb479e6dbe44c021640f8fe7b02d448a979617842885e7422ec38613697b5fd2                                                                                                                  0.0s
 => => naming to docker.io/library/ml-tensorflow                                                                                                                                                              0.0s

What's Next?
  View summary of image vulnerabilities and recommendations → docker scout quickview
(directml)
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/env/windows (main)
$ docker run --rm --gpus all ml-tensorflow
2023-10-01 21:12:33.273007: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-01 21:12:33.291818: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-01 21:12:33.291860: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-01 21:12:33.291874: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-01 21:12:33.295574: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-01 21:12:34.130095: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.130142: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.132513: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.132552: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.132562: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.132583: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.397539: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.397575: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.397587: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.397593: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.397599: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:34.397606: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:35.029848: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:35.029931: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:35.029955: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:35.029967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1977] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-10-01 21:12:35.029977: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:35.029987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1977] Could not identify NUMA node of platform GPU id 1, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-10-01 21:12:35.030005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:35.030018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21286 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-10-01 21:12:35.030230: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-01 21:12:35.030258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21286 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:02:00.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 3s 0us/step
Epoch 1/40
2023-10-01 21:12:47.850028: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8600
2023-10-01 21:12:50.096416: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4f798630 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-10-01 21:12:50.096439: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-10-01 21:12:50.096442: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-10-01 21:12:50.099448: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-10-01 21:12:50.151721: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
10/10 [==============================] - 28s 747ms/step - loss: 5.6548 - accuracy: 0.0189
Epoch 2/40
10/10 [==============================] - 2s 186ms/step - loss: 4.3366 - accuracy: 0.0601
Epoch 3/40
10/10 [==============================] - 2s 186ms/step - loss: 3.9851 - accuracy: 0.1074
Epoch 4/40
10/10 [==============================] - 2s 185ms/step - loss: 4.1473 - accuracy: 0.1119
Epoch 5/40
10/10 [==============================] - 2s 185ms/step - loss: 3.6866 - accuracy: 0.1488
Epoch 6/40
10/10 [==============================] - 2s 188ms/step - loss: 3.4001 - accuracy: 0.1911
Epoch 7/40
10/10 [==============================] - 2s 185ms/step - loss: 3.1273 - accuracy: 0.2419
Epoch 8/40
10/10 [==============================] - 2s 187ms/step - loss: 2.8239 - accuracy: 0.3008
Epoch 9/40
10/10 [==============================] - 2s 185ms/step - loss: 2.4314 - accuracy: 0.3818
Epoch 10/40
10/10 [==============================] - 2s 185ms/step - loss: 2.0109 - accuracy: 0.4773
Epoch 11/40
10/10 [==============================] - 2s 185ms/step - loss: 1.6005 - accuracy: 0.5769
Epoch 12/40
10/10 [==============================] - 2s 185ms/step - loss: 1.2280 - accuracy: 0.6757
Epoch 13/40
10/10 [==============================] - 2s 185ms/step - loss: 0.9023 - accuracy: 0.7565
Epoch 14/40
10/10 [==============================] - 2s 183ms/step - loss: 0.6546 - accuracy: 0.8222
Epoch 15/40
10/10 [==============================] - 2s 184ms/step - loss: 0.4760 - accuracy: 0.8716
Epoch 16/40
10/10 [==============================] - 2s 186ms/step - loss: 0.3487 - accuracy: 0.9066
Epoch 17/40
10/10 [==============================] - 2s 183ms/step - loss: 0.2590 - accuracy: 0.9307
Epoch 18/40
10/10 [==============================] - 2s 184ms/step - loss: 0.2567 - accuracy: 0.9333
Epoch 19/40
10/10 [==============================] - 2s 184ms/step - loss: 0.2593 - accuracy: 0.9320
Epoch 20/40
10/10 [==============================] - 2s 183ms/step - loss: 0.2496 - accuracy: 0.9295
Epoch 21/40
10/10 [==============================] - 2s 184ms/step - loss: 0.2063 - accuracy: 0.9398
Epoch 22/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1733 - accuracy: 0.9498
Epoch 23/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1654 - accuracy: 0.9549
Epoch 24/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1557 - accuracy: 0.9547
Epoch 25/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1539 - accuracy: 0.9574
Epoch 26/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1428 - accuracy: 0.9597
Epoch 27/40
10/10 [==============================] - 2s 185ms/step - loss: 0.1365 - accuracy: 0.9628
Epoch 28/40
10/10 [==============================] - 2s 184ms/step - loss: 0.3606 - accuracy: 0.8968
Epoch 29/40
10/10 [==============================] - 2s 185ms/step - loss: 0.3437 - accuracy: 0.9017
Epoch 30/40
10/10 [==============================] - 2s 185ms/step - loss: 0.2821 - accuracy: 0.9184
Epoch 31/40
10/10 [==============================] - 2s 185ms/step - loss: 0.1910 - accuracy: 0.9436
Epoch 32/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1642 - accuracy: 0.9535
Epoch 33/40
10/10 [==============================] - 2s 185ms/step - loss: 0.1398 - accuracy: 0.9627
Epoch 34/40
10/10 [==============================] - 2s 187ms/step - loss: 0.1175 - accuracy: 0.9736
Epoch 35/40
10/10 [==============================] - 2s 185ms/step - loss: 0.1245 - accuracy: 0.9698
Epoch 36/40
10/10 [==============================] - 2s 185ms/step - loss: 0.1172 - accuracy: 0.9709
Epoch 37/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1055 - accuracy: 0.9717
Epoch 38/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1152 - accuracy: 0.9682
Epoch 39/40
10/10 [==============================] - 2s 184ms/step - loss: 0.1087 - accuracy: 0.9724
Epoch 40/40
10/10 [==============================] - 2s 184ms/step - loss: 0.0827 - accuracy: 0.9767
(directml)
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/env/windows (main)
$ docker run --rm --gpus all ml-tensorflow

image

obriensystems commented 10 months ago

revisiting on 13900b

import tensorflow as tf
#import keras
#from keras.utils import multi_gpu_model
#import keras.backend as k
#https://github.com/microsoft/tensorflow-directml/issues/352

# https://www.tensorflow.org/guide/distributed_training
#
# https://www.tensorflow.org/tutorials/distribute/keras
# https://keras.io/guides/distributed_training/
#strategy = tf.distribute.MirroredStrategy()
#print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

#NUM_GPUS = 2
#strategy = tf.contrib.distribute.MirroredStrategy()#num_gpus=NUM_GPUS)
# working on dual RTX-4090
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
#WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /replica:0/task:0/device:GPU:1,/replica:0/task:0/device:GPU:0
#Number of devices: 2

#central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()
#strategy = tf.distribute.MultiWorkerMirroredStrategy() # not in tf 1.5
#print("mirrored_strategy: ",mirrored_strategy)
#strategy = tf.distribute.OneDeviceStrategy(device="/gpu:1")
#mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps(all_reduce_alg="hierarchical_copy"))
#mirrored_strategy = tf.distribute.MirroredStrategy(devices= ["/gpu:0","/gpu:1"],cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

#print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# https://learn.microsoft.com/en-us/windows/ai/directml/gpu-faq
#a = tf.constant([1.])
#b = tf.constant([2.])
#c = tf.add(a, b)

#gpu_config = tf.GPUOptions()
#gpu_config.visible_device_list = "1"#"0,1"
#gpu_config.visible_device_list = "0,1"
#gpu_config.allow_growth=True

#session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_config))
#print(session.run(c))
#tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (DML:0) is being mapped to multiple DML devices (0 now, and 1 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not  currently supported, see https://github.com/tensorflow/tensorflow/issues/19083
#from keras import backend as K
#K.set_session(session)

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()

with strategy.scope():
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
# https://keras.io/api/models/model/
  parallel_model = tf.keras.applications.ResNet50(
#model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)
# https://saturncloud.io/blog/how-to-do-multigpu-training-with-keras/  
  #parallel_model = multi_gpu_model(model, gpus=2)
  loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# https://keras.io/api/models/model_training_apis/
  parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
parallel_model.fit(x_train, y_train, epochs=10, batch_size=256)#5120)#7168)#7168)

running


michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows (main)
$ docker build -t ml-tensorflow .

michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/environments/windows (main)
$ docker run --rm --gpus all ml-tensorflow

Your kernel may have been built without NUMA support.
2023-11-25 03:52:20.182632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21286 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-11-25 03:52:20.183542: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 03:52:20.183562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21286 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:02:00.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 9s 0us/step
Epoch 1/10
2023-11-25 03:52:49.814023: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8600
2023-11-25 03:52:50.691127: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8600
2023-11-25 03:52:52.932230: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7efed0f30960 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-25 03:52:52.932256: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-11-25 03:52:52.932259: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-11-25 03:52:52.936202: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-25 03:52:52.993445: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
196/196 [==============================] - 37s 55ms/step - loss: 4.2884 - accuracy: 0.0926
Epoch 2/10
196/196 [==============================] - 10s 50ms/step - loss: 3.9655 - accuracy: 0.1563
Epoch 3/10
196/196 [==============================] - 10s 50ms/step - loss: 4.1099 - accuracy: 0.1460
Epoch 4/10
196/196 [==============================] - 10s 49ms/step - loss: 3.6472 - accuracy: 0.1830
Epoch 5/10
196/196 [==============================] - 10s 49ms/step - loss: 3.5968 - accuracy: 0.1977
Epoch 6/10
196/196 [==============================] - 10s 49ms/step - loss: 3.3228 - accuracy: 0.2375
Epoch 7/10
196/196 [==============================] - 10s 49ms/step - loss: 3.1739 - accuracy: 0.2594
Epoch 8/10
196/196 [==============================] - 10s 50ms/step - loss: 3.1227 - accuracy: 0.2679
Epoch 9/10
196/196 [==============================] - 10s 51ms/step - loss: 3.1216 - accuracy: 0.2696
Epoch 10/10
196/196 [==============================] - 10s 49ms/step - loss: 2.8962 - accuracy: 0.2973
(base)