apple / tensorflow_macos

TensorFlow for macOS 11.0+ accelerated using Apple's ML Compute framework.
Other
3.67k stars 308 forks source link

Slow GPU performance on Radeon Pro 460 #7

Open FelixGoetze opened 3 years ago

FelixGoetze commented 3 years ago

I have tested this fork on a simple MNIST example: https://github.com/tensorflow/datasets/blob/master/docs/keras_example.ipynb By default the model uses CPU and takes 3ms per step. If I change to GPU using mlcompute.set_mlc_device(device_name=‘gpu’) each step takes around 12ms.

I am running MacOS Big Sur Version 11.0.1 on the Macbook Pro 2016 15-inch with a Radeon Pro 460 4 GB. Is it expected that the GPU will run much slower than the CPU?

tranbach commented 3 years ago

Same for me using the latest MacBook Pro 16. I trained couple epochs of VGG19: the GPU version takes 49 seconds, CPU version takes 7 seconds, tensorflow 2.3.1 takes 6 seconds, while plaidML it takes 2 seconds to train. I thought the AMD GPUs are supported through Metal like plaidML ...

I get the following messages from tensorflow:

2020-11-18 20:29:50.600229: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-11-18 20:29:52.421434: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)

nehbit commented 3 years ago

I can confirm this is also the case for two AMD GPUs I've tested on different machines. Both were much, much slower than running the same thing on the CPU.

tzm41 commented 3 years ago

Same, testing MNIST with sample CNN on my MBP 16 2019 with AMD Radeon Pro 5500M, it seems to get stuck in between batches.

sevenold commented 3 years ago

same, testing MNIST with sample CNN on my MBP 16 2019 with AMD Radeon Pro 5300M WARNING:tensorflow:Eager mode on GPU is extremely slow. Consider to use CPU instead

JL1829 commented 3 years ago

Same, I have below two warning:

  1. 1. WARNING:tensorflow:Eager mode on GPU is extremely slow. Consider to use CPU instead
  2. 2. 2020-11-19 14:42:26.766401: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)

And I can see from the Activity Monitor, no GPU was utilised.

qixiang109 commented 3 years ago

Same, I have below two warning:

  1. 1. WARNING:tensorflow:Eager mode on GPU is extremely slow. Consider to use CPU instead
  2. 2. 2020-11-19 14:42:26.766401: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)

And I can see from the Activity Monitor, no GPU was utilised.

your warning says that "Eager mode on GPU is extremely slow", try tf.compat.v1.disable_eager_execution()

JL1829 commented 3 years ago

Same, I have below two warning:

  1. 1. WARNING:tensorflow:Eager mode on GPU is extremely slow. Consider to use CPU instead
  2. 2. 2020-11-19 14:42:26.766401: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)

And I can see from the Activity Monitor, no GPU was utilised.

your warning says that "Eager mode on GPU is extremely slow", try tf.compat.v1.disable_eager_execution()

Tried, same

nehbit commented 3 years ago

I can confirm this is also the case for two AMD GPUs I've tested on different machines. Both were much, much slower than running the same thing on the CPU.

I've spent some time in the evening to properly set up an Anaconda environment with this in it, and fed it a nontrivial task. I can confirm that it is indeed giving me about 3x speed boost when my GPU is about 60% utilisation. I suspect a larger task would get closer to 100% utilisation and that would give us the expected ~5x speed over CPU. So in my set-up at least, it is now indeed working correctly. Just make sure you add these two lines at the beginning of the file you're running:

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name = 'gpu')

And ignore the fact that TF still tells you that there is no GPU present even after these lines.

Here's the GPU vs CPU comparison I've got on a MNIST simple image classification task from the official TF2 models library (link):

CPU:

58/58 [==============================] - 44s 731ms/step - loss: 2.2478 - sparse_categorical_accuracy: 0.2123 - val_loss: 1.6934 - val_sparse_categorical_accuracy: 0.7380
Epoch 2/10
58/58 [==============================] - 41s 712ms/step - loss: 1.3346 - sparse_categorical_accuracy: 0.6389 - val_loss: 0.5675 - val_sparse_categorical_accuracy: 0.8169
Epoch 3/10
58/58 [==============================] - 42s 722ms/step - loss: 0.6407 - sparse_categorical_accuracy: 0.7925 - val_loss: 0.3464 - val_sparse_categorical_accuracy: 0.9036
Epoch 4/10
58/58 [==============================] - 42s 719ms/step - loss: 0.4668 - sparse_categorical_accuracy: 0.8519 - val_loss: 0.3279 - val_sparse_categorical_accuracy: 0.8989
Epoch 5/10
58/58 [==============================] - 42s 728ms/step - loss: 0.4090 - sparse_categorical_accuracy: 0.8706 - val_loss: 0.2688 - val_sparse_categorical_accuracy: 0.9206
Epoch 6/10
58/58 [==============================] - 43s 739ms/step - loss: 0.3439 - sparse_categorical_accuracy: 0.8930 - val_loss: 0.2169 - val_sparse_categorical_accuracy: 0.9355
Epoch 7/10
58/58 [==============================] - 42s 727ms/step - loss: 0.3048 - sparse_categorical_accuracy: 0.9069 - val_loss: 0.1968 - val_sparse_categorical_accuracy: 0.9423
Epoch 8/10
58/58 [==============================] - 43s 744ms/step - loss: 0.2650 - sparse_categorical_accuracy: 0.9180 - val_loss: 0.2029 - val_sparse_categorical_accuracy: 0.9393
Epoch 9/10
58/58 [==============================] - 42s 733ms/step - loss: 0.2947 - sparse_categorical_accuracy: 0.9077 - val_loss: 0.1733 - val_sparse_categorical_accuracy: 0.9486
Epoch 10/10
58/58 [==============================] - 43s 746ms/step - loss: 0.2352 - sparse_categorical_accuracy: 0.9256 - val_loss: 0.1637 - val_sparse_categorical_accuracy: 0.9484

GPU: (AMD Radeon RX Vega 64, 8GB)

58/58 [==============================] - 21s 278ms/step - loss: 2.0568 - sparse_categorical_accuracy: 0.2967 - val_loss: 0.5769 - val_sparse_categorical_accuracy: 0.8364
Epoch 2/10
58/58 [==============================] - 15s 258ms/step - loss: 0.5700 - sparse_categorical_accuracy: 0.8216 - val_loss: 0.2908 - val_sparse_categorical_accuracy: 0.9163
Epoch 3/10
58/58 [==============================] - 15s 254ms/step - loss: 0.3343 - sparse_categorical_accuracy: 0.9014 - val_loss: 0.2121 - val_sparse_categorical_accuracy: 0.9417
Epoch 4/10
58/58 [==============================] - 15s 255ms/step - loss: 0.2452 - sparse_categorical_accuracy: 0.9273 - val_loss: 0.1726 - val_sparse_categorical_accuracy: 0.9486
Epoch 5/10
58/58 [==============================] - 15s 254ms/step - loss: 0.2032 - sparse_categorical_accuracy: 0.9394 - val_loss: 0.1475 - val_sparse_categorical_accuracy: 0.9572
Epoch 6/10
58/58 [==============================] - 15s 253ms/step - loss: 0.1784 - sparse_categorical_accuracy: 0.9468 - val_loss: 0.1266 - val_sparse_categorical_accuracy: 0.9625
Epoch 7/10
58/58 [==============================] - 15s 255ms/step - loss: 0.1600 - sparse_categorical_accuracy: 0.9515 - val_loss: 0.1157 - val_sparse_categorical_accuracy: 0.9659
Epoch 8/10
58/58 [==============================] - 15s 253ms/step - loss: 0.1431 - sparse_categorical_accuracy: 0.9573 - val_loss: 0.1016 - val_sparse_categorical_accuracy: 0.9679
Epoch 9/10
58/58 [==============================] - 15s 254ms/step - loss: 0.1322 - sparse_categorical_accuracy: 0.9603 - val_loss: 0.0919 - val_sparse_categorical_accuracy: 0.9715
Epoch 10/10
58/58 [==============================] - 15s 258ms/step - loss: 0.1181 - sparse_categorical_accuracy: 0.9650 - val_loss: 0.0827 - val_sparse_categorical_accuracy: 0.9750

As an aside, I find it funny that Apple managed to do this before AMD for AMD graphics cards — right now, the only way to use an AMD card on a real ML (in TF) workflow is to stick it into a Mac, since AMDs own effort, rOCM, is still fairly unfinished. Interesting times!

Whichever engineering team inside Apple that managed to pull this off, major kudos.

sevenold commented 3 years ago

MBP 16 2019

code:

#!/usr/bin/env python
# coding: utf-8
import tensorflow.compat.v2 as tf
import tensorflow_datasets as tfds

tf.enable_v2_behavior()

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name='gpu')

(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(
    normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

ds_test = ds_test.map(
    normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(128,activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=tf.keras.optimizers.Adam(0.001),
    metrics=['accuracy'],
)

model.fit(
    ds_train,
    epochs=10,
    validation_data=ds_test,

)

GPU: (AMD Radeon Pro 5300M 4 GB)

enviroment: image

logs: image

GPU: image

CPU(2.6 GHz 六核Intel Core i7)

code:


.
.
.
# from tensorflow.python.compiler.mlcompute import mlcompute
# mlcompute.set_mlc_device(device_name='gpu')
.
.
.

enviroment: image

logs: image

GPU:

image

nehbit commented 3 years ago

Interesting, I ran your code on both GPU and CPU, my results are similar: in your task, the CPU seems faster. That said, if I had to make a completely uneducated guess, I'd say your task is small enough that moving data from CPU to GPU for processing is taking the lion's share of time spent, and the GPU is spending most of the time waiting. The GPU power consumption I get on my task is around 300W, in yours it barely goes above idle at 50W.

This is the model I used on my test, but I suspect even this is too fast per epoch. Might want to give it a shot: https://github.com/tensorflow/models/tree/master/official/vision/image_classification

This is what I got with your code:

GPU:

469/469 [==============================] - 7s 11ms/step - loss: 0.6087 - accuracy: 0.8330 - val_loss: 0.1977 - val_accuracy: 0.9423
Epoch 2/10
469/469 [==============================] - 4s 9ms/step - loss: 0.1781 - accuracy: 0.9501 - val_loss: 0.1355 - val_accuracy: 0.9593
Epoch 3/10
469/469 [==============================] - 4s 9ms/step - loss: 0.1220 - accuracy: 0.9647 - val_loss: 0.1126 - val_accuracy: 0.9680
Epoch 4/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0943 - accuracy: 0.9739 - val_loss: 0.0924 - val_accuracy: 0.9724
Epoch 5/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0779 - accuracy: 0.9771 - val_loss: 0.0833 - val_accuracy: 0.9744
Epoch 6/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0603 - accuracy: 0.9835 - val_loss: 0.0794 - val_accuracy: 0.9756
Epoch 7/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0495 - accuracy: 0.9859 - val_loss: 0.0743 - val_accuracy: 0.9771
Epoch 8/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0424 - accuracy: 0.9883 - val_loss: 0.0687 - val_accuracy: 0.9790
Epoch 9/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0359 - accuracy: 0.9899 - val_loss: 0.0713 - val_accuracy: 0.9779
Epoch 10/10
469/469 [==============================] - 4s 9ms/step - loss: 0.0289 - accuracy: 0.9924 - val_loss: 0.0707 - val_accuracy: 0.9777

CPU:

469/469 [==============================] - 4s 4ms/step - loss: 0.6033 - accuracy: 0.8359 - val_loss: 0.1923 - val_accuracy: 0.9457
Epoch 2/10
469/469 [==============================] - 1s 2ms/step - loss: 0.1792 - accuracy: 0.9499 - val_loss: 0.1379 - val_accuracy: 0.9605
Epoch 3/10
469/469 [==============================] - 1s 2ms/step - loss: 0.1238 - accuracy: 0.9645 - val_loss: 0.1093 - val_accuracy: 0.9671
Epoch 4/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0929 - accuracy: 0.9737 - val_loss: 0.0967 - val_accuracy: 0.9707
Epoch 5/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0748 - accuracy: 0.9778 - val_loss: 0.0845 - val_accuracy: 0.9738
Epoch 6/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0608 - accuracy: 0.9825 - val_loss: 0.0764 - val_accuracy: 0.9769
Epoch 7/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0528 - accuracy: 0.9853 - val_loss: 0.0768 - val_accuracy: 0.9764
Epoch 8/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0428 - accuracy: 0.9886 - val_loss: 0.0849 - val_accuracy: 0.9722
Epoch 9/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0350 - accuracy: 0.9905 - val_loss: 0.0752 - val_accuracy: 0.9771
Epoch 10/10
469/469 [==============================] - 1s 2ms/step - loss: 0.0297 - accuracy: 0.9923 - val_loss: 0.0730 - val_accuracy: 0.9787
bryanlimy commented 3 years ago

@sevenold I don't think we have to disable eager execution. Also, 2ms/step on CPU might suggest that the task is too small and the overhead of transferring data between GPU and CPU is larger than the speedup you would get from a GPU. Maybe try setting up a larger network?

EDIT: I think this model is way too small to see any benefits from the GPU

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(128,activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])

@nehbit do you mind sharing your steps to install tensorflow_macos with conda instead of virtualenv?

nehbit commented 3 years ago

@bryanlimy It wasn't anything special, but here it goes:

That's pretty much it. I did this after spending a good hour trying to get scikit-learn on the virtualenv created by the script in this repo, which is needed to run anything nontrivial.

sevenold commented 3 years ago

After increasing the model parameters,GPU is faster than CPU. CPU vs GPU image @nehbit

dkgaraujo commented 3 years ago

Hi! For R users, I created a benchmark code to compare tf-mac with CPU or GPU, as well as with GPU-accelerated plaidml. You can find the code here: https://github.com/dkgaraujo/TensorflowMacOSBenchmark

chandc commented 3 years ago

I have been able to reproduce the MNIST CNN runtime results on an MBP 16-inch, 2019, 32GB RAM, AMD Radeon Pro 5500M 8GB and with BigSur version 11.0.1, Python 3.8.6

CPU = 132 secs GPU = 105 secs Colab = 23 secs

Outputs are provided below:

Train: X=(60000, 28, 28), y=(60000,) Test: X=(10000, 28, 28), y=(10000,) Model: "sequential"


Layer (type) Output Shape Param #

conv2d (Conv2D) (None, 26, 26, 32) 320


max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0


conv2d_1 (Conv2D) (None, 12, 12, 32) 4128


max_pooling2d_1 (MaxPooling2 (None, 6, 6, 32) 0


flatten (Flatten) (None, 1152) 0


dense (Dense) (None, 500) 576500


dense_1 (Dense) (None, 10) 5010

Total params: 585,958 Trainable params: 585,958 Non-trainable params: 0

CPU Epoch 1/10 469/469 [==============================] - 13s 27ms/step - loss: 0.3570 - accuracy: 0.8916 Epoch 2/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0474 - accuracy: 0.9847 Epoch 3/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0279 - accuracy: 0.9914 Epoch 4/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0203 - accuracy: 0.9940 Epoch 5/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0142 - accuracy: 0.9954 Epoch 6/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0111 - accuracy: 0.9964 Epoch 7/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0099 - accuracy: 0.9967 Epoch 8/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0073 - accuracy: 0.9977 Epoch 9/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0077 - accuracy: 0.9976 Epoch 10/10 469/469 [==============================] - 13s 28ms/step - loss: 0.0054 - accuracy: 0.9983 Time: 132.01292276382446 Accuracy: 0.989

GPU with eager execution disabled Epoch 1/10 60000/60000 [==============================] - 10s 172us/sample - loss: 0.1522 - accuracy: 0.9535 Epoch 2/10 60000/60000 [==============================] - 10s 171us/sample - loss: 0.0465 - accuracy: 0.9854 Epoch 3/10 60000/60000 [==============================] - 10s 169us/sample - loss: 0.0311 - accuracy: 0.9906 Epoch 4/10 60000/60000 [==============================] - 10s 169us/sample - loss: 0.0213 - accuracy: 0.9931 Epoch 5/10 60000/60000 [==============================] - 10s 170us/sample - loss: 0.0157 - accuracy: 0.9950 Epoch 6/10 60000/60000 [==============================] - 10s 174us/sample - loss: 0.0117 - accuracy: 0.9963 Epoch 7/10 60000/60000 [==============================] - 11s 184us/sample - loss: 0.0093 - accuracy: 0.9969 Epoch 8/10 60000/60000 [==============================] - 11s 179us/sample - loss: 0.0087 - accuracy: 0.9972 Epoch 9/10 60000/60000 [==============================] - 11s 176us/sample - loss: 0.0076 - accuracy: 0.9974 Epoch 10/10 60000/60000 [==============================] - 11s 176us/sample - loss: 0.0059 - accuracy: 0.9981 Time: 104.61961388587952

Colab Epoch 1/10 469/469 [==============================] - 2s 4ms/step - loss: 0.1515 - accuracy: 0.9537 Epoch 2/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0464 - accuracy: 0.9855 Epoch 3/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0307 - accuracy: 0.9901 Epoch 4/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0215 - accuracy: 0.9929 Epoch 5/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0155 - accuracy: 0.9950 Epoch 6/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0127 - accuracy: 0.9961 Epoch 7/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0096 - accuracy: 0.9967 Epoch 8/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0071 - accuracy: 0.9977 Epoch 9/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0079 - accuracy: 0.9973 Epoch 10/10 469/469 [==============================] - 2s 3ms/step - loss: 0.0062 - accuracy: 0.9977 Time: 23.20409369468689 Accuracy: 0.990

qixiang109 commented 3 years ago

I can confirm this is also the case for two AMD GPUs I've tested on different machines. Both were much, much slower than running the same thing on the CPU.

I've spent some time in the evening to properly set up an Anaconda environment with this in it, and fed it a nontrivial task. I can confirm that it is indeed giving me about 3x speed boost when my GPU is about 60% utilisation. I suspect a larger task would get closer to 100% utilisation and that would give us the expected ~5x speed over CPU. So in my set-up at least, it is now indeed working correctly. Just make sure you add these two lines at the beginning of the file you're running:

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name = 'gpu')

And ignore the fact that TF still tells you that there is no GPU present even after these lines.

Here's the GPU vs CPU comparison I've got on a MNIST simple image classification task from the official TF2 models library (link):

CPU:

58/58 [==============================] - 44s 731ms/step - loss: 2.2478 - sparse_categorical_accuracy: 0.2123 - val_loss: 1.6934 - val_sparse_categorical_accuracy: 0.7380
Epoch 2/10
58/58 [==============================] - 41s 712ms/step - loss: 1.3346 - sparse_categorical_accuracy: 0.6389 - val_loss: 0.5675 - val_sparse_categorical_accuracy: 0.8169
Epoch 3/10
58/58 [==============================] - 42s 722ms/step - loss: 0.6407 - sparse_categorical_accuracy: 0.7925 - val_loss: 0.3464 - val_sparse_categorical_accuracy: 0.9036
Epoch 4/10
58/58 [==============================] - 42s 719ms/step - loss: 0.4668 - sparse_categorical_accuracy: 0.8519 - val_loss: 0.3279 - val_sparse_categorical_accuracy: 0.8989
Epoch 5/10
58/58 [==============================] - 42s 728ms/step - loss: 0.4090 - sparse_categorical_accuracy: 0.8706 - val_loss: 0.2688 - val_sparse_categorical_accuracy: 0.9206
Epoch 6/10
58/58 [==============================] - 43s 739ms/step - loss: 0.3439 - sparse_categorical_accuracy: 0.8930 - val_loss: 0.2169 - val_sparse_categorical_accuracy: 0.9355
Epoch 7/10
58/58 [==============================] - 42s 727ms/step - loss: 0.3048 - sparse_categorical_accuracy: 0.9069 - val_loss: 0.1968 - val_sparse_categorical_accuracy: 0.9423
Epoch 8/10
58/58 [==============================] - 43s 744ms/step - loss: 0.2650 - sparse_categorical_accuracy: 0.9180 - val_loss: 0.2029 - val_sparse_categorical_accuracy: 0.9393
Epoch 9/10
58/58 [==============================] - 42s 733ms/step - loss: 0.2947 - sparse_categorical_accuracy: 0.9077 - val_loss: 0.1733 - val_sparse_categorical_accuracy: 0.9486
Epoch 10/10
58/58 [==============================] - 43s 746ms/step - loss: 0.2352 - sparse_categorical_accuracy: 0.9256 - val_loss: 0.1637 - val_sparse_categorical_accuracy: 0.9484

GPU: (AMD Radeon RX Vega 64, 8GB)

58/58 [==============================] - 21s 278ms/step - loss: 2.0568 - sparse_categorical_accuracy: 0.2967 - val_loss: 0.5769 - val_sparse_categorical_accuracy: 0.8364
Epoch 2/10
58/58 [==============================] - 15s 258ms/step - loss: 0.5700 - sparse_categorical_accuracy: 0.8216 - val_loss: 0.2908 - val_sparse_categorical_accuracy: 0.9163
Epoch 3/10
58/58 [==============================] - 15s 254ms/step - loss: 0.3343 - sparse_categorical_accuracy: 0.9014 - val_loss: 0.2121 - val_sparse_categorical_accuracy: 0.9417
Epoch 4/10
58/58 [==============================] - 15s 255ms/step - loss: 0.2452 - sparse_categorical_accuracy: 0.9273 - val_loss: 0.1726 - val_sparse_categorical_accuracy: 0.9486
Epoch 5/10
58/58 [==============================] - 15s 254ms/step - loss: 0.2032 - sparse_categorical_accuracy: 0.9394 - val_loss: 0.1475 - val_sparse_categorical_accuracy: 0.9572
Epoch 6/10
58/58 [==============================] - 15s 253ms/step - loss: 0.1784 - sparse_categorical_accuracy: 0.9468 - val_loss: 0.1266 - val_sparse_categorical_accuracy: 0.9625
Epoch 7/10
58/58 [==============================] - 15s 255ms/step - loss: 0.1600 - sparse_categorical_accuracy: 0.9515 - val_loss: 0.1157 - val_sparse_categorical_accuracy: 0.9659
Epoch 8/10
58/58 [==============================] - 15s 253ms/step - loss: 0.1431 - sparse_categorical_accuracy: 0.9573 - val_loss: 0.1016 - val_sparse_categorical_accuracy: 0.9679
Epoch 9/10
58/58 [==============================] - 15s 254ms/step - loss: 0.1322 - sparse_categorical_accuracy: 0.9603 - val_loss: 0.0919 - val_sparse_categorical_accuracy: 0.9715
Epoch 10/10
58/58 [==============================] - 15s 258ms/step - loss: 0.1181 - sparse_categorical_accuracy: 0.9650 - val_loss: 0.0827 - val_sparse_categorical_accuracy: 0.9750

As an aside, I find it funny that Apple managed to do this before AMD for AMD graphics cards — right now, the only way to use an AMD card on a real ML (in TF) workflow is to stick it into a Mac, since AMDs own effort, rOCM, is still fairly unfinished. Interesting times!

Whichever engineering team inside Apple that managed to pull this off, major kudos.

can you try a larger model such as ResNet50, and btw, which AMD card do you use?

lostmsu commented 3 years ago

To all of you guys in this thread. You need to understand, that the MNIST dataset and the model used in the example are too small to gain from GPU due to the need to juggle data between them outweighing any potential performance gains.

To see a real difference (if any) you can:

  1. Increase number of layers.
  2. Increase layer widths (e.g. for dense layers - number of units, for convolutions - number of channels).
  3. Increase batch size to 64 (typical for many training tasks).

Try replacing the last cell with

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=tf.keras.optimizers.Adam(0.001),
    metrics=['accuracy'],
)

model.fit(
    ds_train,
    epochs=6,
    batch_size=64,
    validation_data=ds_test,
)

P.S. I don't have a compatible Mac myself, but interested in the results from various machines.

WARNING: NETWORK INCREASED FURTHER FOR THIS RESULT, UPDATE YOUR CODE:

With these settings I get about 10ms/step on my Titan V and ~87% GPU load according to nvidia-smi.

Akshaysehgal2005 commented 3 years ago

To all of you guys in this thread. You need to understand, that the MNIST dataset and the model used in the example are too small to gain from GPU due to the need to juggle data between them outweighing any potential performance gains.

To see a real difference (if any) you can:

  1. Increase number of layers.
  2. Increase layer widths (e.g. for dense layers - number of units, for convolutions - number of channels).
  3. Increase batch size to 64 (typical for many training tasks).

Try replacing the last cell with

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=tf.keras.optimizers.Adam(0.001),
    metrics=['accuracy'],
)

model.fit(
    ds_train,
    epochs=6,
    batch_size=64,
    validation_data=ds_test,
)

P.S. I don't have a compatible Mac myself, but interested in the results from various machines.

WARNING: NETWORK INCREASED FURTHER FOR THIS RESULT, UPDATE YOUR CODE:

With these settings I get about 10ms/step on my Titan V and ~87% GPU load according to nvidia-smi.

Sadly, I still couldn't replicate the results you mention above. Used around 53M parameters and still see drastic difference between CPU and GPU speeds (CPU being much faster). I do see GPU utilization in activity moniter with device is set to GPU though. Maybe it's because of the 'compatible Macs' you mention? Is there a list of such devices? From what I saw, requirements only mention Big Sur 11.

anna-tikhonova commented 3 years ago

To provide an update: VGG19: We've identified the issue and the fix with be in the next update. MNIST: We will investigate and report back.

giordan12 commented 3 years ago

Related to this issue, has anyone else attempted to use a tf.data.Dataset? I am using one and my training times using the GPU are much slower than the CPU. I'm registering 797ms using the CPU and 386s using the GPU.

dkgaraujo commented 3 years ago

Related to this issue, has anyone else attempted to use a tf.data.Dataset? I am using one and my training times using the GPU are much slower than the CPU. I'm registering 797ms using the CPU and 386s using the GPU.

Hi, @giordan12: it is very likely the issue you are facing of comparatively slow performance with GPU is not related to the specific source of the data (tf.data.Dataset in this case), but with the fact that either your dataset or even your batch_size is too small to really optimize performance with the GPU. Remember they are massively parallel calculation machines, so as a rough rule, to take advantage of GPU acceleration you want preferably to send as large batches as they can get their hands on.

For more discussion and results on this, please see https://github.com/apple/tensorflow_macos/issues/25#issuecomment-731568938 and associated thread.

jkleckner commented 3 years ago

@giordan12 It would be interesting to view the gpu usage as done in that other thread.

anna-tikhonova commented 3 years ago

Same for me using the latest MacBook Pro 16. I trained couple epochs of VGG19: the GPU version takes 49 seconds, CPU version takes 7 seconds, tensorflow 2.3.1 takes 6 seconds, while plaidML it takes 2 seconds to train. I thought the AMD GPUs are supported through Metal like plaidML ...

@tranbach Could you tell us what batch size you were using to train VGG19?

tranbach commented 3 years ago

@anna-tikhonova Batch size of 64.

quangtvdevnet commented 3 years ago

Hi,

This addons runs so slow in my Mac (2019 pro 16 inch - AMD Radeon Pro 5500M 4GB). The Mac hangs and has to restart...

I run on ResNet NN.