dmckinno commented 3 years ago

Training 5 epochs on the network below on the tf.keras.datasets.mnist dataset takes ~5x longer with ML Compute than PlaidML. Is this is expected behavior?

Note that both of these are significantly faster than CPU training, but PlaidML seems to do a much better job with acceleration. Are there ML Compute-specific considerations that I need to keep in mind?

Model

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 4096)              3215360   
_________________________________________________________________
dense_1 (Dense)              (None, 4096)              16781312  
_________________________________________________________________
dense_2 (Dense)              (None, 4096)              16781312  
_________________________________________________________________
dense_3 (Dense)              (None, 4096)              16781312  
_________________________________________________________________
dense_4 (Dense)              (None, 10)                40970     
=================================================================
Total params: 53,600,266
Trainable params: 53,600,266
Non-trainable params: 0

PlaidML

Epoch 1/5
60000/60000 [==============================] - 22s 364us/step - loss: 0.6628 - acc: 0.8160
Epoch 2/5
60000/60000 [==============================] - 20s 337us/step - loss: 0.0871 - acc: 0.9739
Epoch 3/5
60000/60000 [==============================] - 20s 338us/step - loss: 0.0528 - acc: 0.9832
Epoch 4/5
60000/60000 [==============================] - 20s 338us/step - loss: 0.0328 - acc: 0.9898
Epoch 5/5
60000/60000 [==============================] - 20s 338us/step - loss: 0.0273 - acc: 0.9914
CPU times: user 3.93 s, sys: 3.64 s, total: 7.56 s
Wall time: 1min 43s

ML Compute

Epoch 1/5
59/59 [==============================] - 131s 2s/step - loss: 1.4286 - accuracy: 0.6370
Epoch 2/5
59/59 [==============================] - 133s 2s/step - loss: 0.0905 - accuracy: 0.9729
Epoch 3/5
59/59 [==============================] - 129s 2s/step - loss: 0.0482 - accuracy: 0.9855
Epoch 4/5
59/59 [==============================] - 135s 2s/step - loss: 0.0322 - accuracy: 0.9899
Epoch 5/5
59/59 [==============================] - 141s 2s/step - loss: 0.0239 - accuracy: 0.9925
CPU times: user 20.5 s, sys: 9.1 s, total: 29.6 s
Wall time: 11min 10s

leedrake5 commented 3 years ago

I’m seeing the same thing - gpu use is also quite low, with observable gaps. I’m not sure what they’re going for with this.

anna-tikhonova commented 3 years ago

Thank you very much for reporting this. Could you provide a reproducible test case, so we know exactly what you are running and can investigate locally?

There is an optional mlcompute.set_mlc_device(device_name=’any') API for ML Compute device selection. The default value for device_name is 'any’, which means ML Compute will select the best available device on your system, including multiple GPUs on multi-GPU configurations. Could you try running with ‘cpu’ and ‘gpu’ and let us know what you see? Thank you!

dmckinno commented 3 years ago

Sure. Code below.

When I set mlcompute.set_mlc_device(device_name=’cpu'), the wall time increased to almost 17 minutes (see below). ML Compute is clearly accelerating something on the GPU, but it is doing it much less efficiently than PlaidML.

You can see that is is using the Radeon rather than the Intel GPU from the attached Activity Monitor screenshots (device_name=’gpu' above and device_name=’cpu' below).

Epoch 1/5
59/59 [==============================] - 203s 3s/step - loss: 1.2530 - accuracy: 0.6592
Epoch 2/5
59/59 [==============================] - 197s 3s/step - loss: 0.0908 - accuracy: 0.9719
Epoch 3/5
59/59 [==============================] - 205s 3s/step - loss: 0.0530 - accuracy: 0.9833
Epoch 4/5
59/59 [==============================] - 207s 4s/step - loss: 0.0306 - accuracy: 0.9903
Epoch 5/5
59/59 [==============================] - 205s 3s/step - loss: 0.0247 - accuracy: 0.9921
CPU times: user 2h 6min 46s, sys: 38.4 s, total: 2h 7min 25s
Wall time: 16min 58s

PlaidML

import numpy as np
import os

os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

import keras

mnist = keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = keras.models.Sequential([
  keras.layers.Flatten(input_shape=(28, 28, 1)),
  keras.layers.Dense(4096,activation='relu'),
  keras.layers.Dense(4096,activation='relu'),
  keras.layers.Dense(4096,activation='relu'),
  keras.layers.Dense(4096,activation='relu'),
  keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'],)

model.fit(np.expand_dims(x_train,3), y_train, epochs=5, batch_size=1024)

ML Compute

import tensorflow as tf

from tensorflow.python.compiler.mlcompute import mlcompute

mlcompute.set_mlc_device(device_name='gpu') # Available options are 'cpu', 'gpu', and ‘any'.

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(4096,activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'],)

model.fit(x_train, y_train, epochs=5, batch_size=1024)

dmckinno commented 3 years ago

@anna-tikhonova, any resolution here? Would love to begin porting some code from PlaidML to ML Compute.

Aarsh2001 commented 3 years ago

I have the same issue , this is a script I ran on ml compute

import tensorflow as tf tf.config.run_functions_eagerly(False) from tensorflow.python.framework.ops import disable_eager_execution disable_eager_execution() from tensorflow.python.compiler.mlcompute import mlcompute mlcompute.set_mlc_device(device_name='gpu')

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense,Dropout,Activation,Flatten,Conv2D, MaxPooling2D from tensorflow.keras.callbacks import TensorBoard import pickle import warnings import time

warnings.filterwarnings("ignore") NAME="Cats-vs-Dogs-64*2-{}".format(int(time.time())) tensorboard= TensorBoard(log_dir=f'logs/{NAME}')

X=pickle.load(open("X.pickle","rb")) y=pickle.load(open("y.pickle",'rb')) X=X/255.0

model=Sequential() model.add(Conv2D(64,(3,3),input_shape=X.shape[1:]))
model.add(Activation("relu")) model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(64,(3,3))) model.add(Activation("relu")) model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten())#convolution is 2d whereas dense layer needs 1d so flatten,probably

dont need this for this dataset

model.add(Dense(64)) model.add(Dense(1)) model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='adam',metrics= ['accuracy']) model.fit(X,y,batch_size=50,validation_split=0.1,epochs=10, callbacks=[tensorboard])

output Train on 22451 samples, validate on 2495 samples Epoch 1/10 22451/22451 [==============================] - 84s 4ms/sample - loss: 0.6625 - accuracy: 0.6123 - val_loss: 0.6311 - val_accuracy: 0.6653

while on plaidML ETA was 64s sample. It seems ML compute doesn't utilise the GPU to its complete extent. Does ML compute have an option like there is on plaidML to select for specific GPU while setup? @anna-tikhonova

apple / tensorflow_macos

Training speed ~5x slower than PlaidML with AMD Radeon Pro 5500M #110

dont need this for this dataset