I Can't use GPU on m1 MacBook Pro

ElaheHkh commented 3 years ago

I am trying to use TensorFlow in the new MacBook Pro M1 but I can't find GPU. I tried to download and install https://github.com/apple/tensorflow_macos/releases manually and unmanually. This didn't work for me. I confused 😔

GiorgioMannarini commented 3 years ago

Hey! Try to disable eager execution:

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

Then set the device to GPU.

ElaheHkh commented 3 years ago

Hey! Try to disable eager execution:
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
Then set the device to GPU.

I do it but don't work for me.

Rustam-Z commented 3 years ago

Hello, have you solved this issue? I have the same problem, the GPU is not working on my Mac M1.

Rustam-Z commented 3 years ago

Here I have compared Google Colab and Mac M1 running time per each epoch, both on CPU:

ElaheHkh commented 3 years ago

Here I have compared Google Colab and Mac M1 running time per each epoch, both on CPU:

I can run Keras on GPU but not torch you can check GPU usage by activity monitor

Rustam-Z commented 3 years ago

Before running the code:
During training:

As you see only CPU is being loaded. @ElaheHkh how have you enabled the GPU?

ElaheHkh commented 3 years ago

I download tensorflow_macos https://github.com/apple/tensorflow_macos/releases and move it to /user/ then try run these instructions: %tar xvzf tensorflow_macos-${VERSION}.tar % cd tensorflow_macos % ./install_venv.sh --prompt cd cd tensorflow_macos bash install_venv.sh --prompt

conda install -c conda-forge -y absl-py conda install -c conda-forge -y astunparse conda install -c conda-forge -y gast conda install -c conda-forge -y opt_einsum conda install -c conda-forge -y termcolor conda install -c conda-forge -y typing_extensions conda install -c conda-forge -y wheel conda install -c conda-forge -y typeguard

pip install --upgrade --no-dependencies --force grpcio-1.33.2-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force h5py-2.10.0-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force numpy-1.18.5-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force tensorflow_addons_macos-0.1a3-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force tensorflow_macos-0.1a3-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force tensorflow_addons-0.11.2+mlcompute-cp38-cp38-macosx_11_0_arm64.whl

pip install pyopencl pip install --upgrade google-api-python-client pip install absl-py
pip install wrapt pip install monotonic pip install netifaces pip install astunparse pip install flatbuffers pip install gast pip install google_pasta

pip install keras_preprocessing pip install opt_einsum pip install protobuf pip install tensorflow_estimator pip install termcolor pip install typing_extensions pip install wheel pip install tensorboard pip install typeguard pip install tqdm conda install torchvision -c pytorch pip install tensorflow_datasets pip3 install git+https://github.com/geohot/tinygrad.git --upgrade

pip install tensorboard pip install cython git clone https://github.com/pandas-dev/pandas.git cd pandas python3 setup.py install pip install ipywidgets conda update -n base conda conda install pytorch torchvision -c pytorch

ManuelSchneid3r commented 3 years ago

I can run Keras on GPU but not torch

So you got tf.keras using the GPU working? Can you please run this one and make a screenshot of the GPU load?

import os
#os.environ["TF_DISABLE_MLC"] = "1"
#os.environ["TF_MLC_LOGGING"] = "1"

from tensorflow.python.compiler.mlcompute import mlcompute
#mlcompute.set_mlc_device(device_name='gpu')
print("is_apple_mlc_enabled %s" % mlcompute.is_apple_mlc_enabled())
print("is_tf_compiled_with_apple_mlc %s" % mlcompute.is_tf_compiled_with_apple_mlc())

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
tf.compat.v1.disable_eager_execution()

print(f"eagerly? {tf.executing_eagerly()}")
print(tf.config.list_logical_devices())

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10,
                    validation_data=(test_images, test_labels))

ManuelSchneid3r commented 3 years ago

Considering the device list I found this post of an apple contributor.

arge-7 commented 3 years ago

I can run Keras on GPU but not torch

So you got tf.keras using the GPU working? Can you please run this one and make a screenshot of the GPU load?

import os
#os.environ["TF_DISABLE_MLC"] = "1"
#os.environ["TF_MLC_LOGGING"] = "1"

from tensorflow.python.compiler.mlcompute import mlcompute
#mlcompute.set_mlc_device(device_name='gpu')
print("is_apple_mlc_enabled %s" % mlcompute.is_apple_mlc_enabled())
print("is_tf_compiled_with_apple_mlc %s" % mlcompute.is_tf_compiled_with_apple_mlc())

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
tf.compat.v1.disable_eager_execution()

print(f"eagerly? {tf.executing_eagerly()}")
print(tf.config.list_logical_devices())

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10,
                    validation_data=(test_images, test_labels))

I'm troubleshooting the same issue and just ran this for fun. On a fresh install of this fork of TF2, my M1 Mac mini uses 21% of CPU and 11% GPU.


is_tf_compiled_with_apple_mlc True
eagerly? False
[LogicalDevice(name='/device:CPU:0', device_type='CPU')]
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
170500096/170498071 [==============================] - 132s 1us/step
Train on 50000 samples, validate on 10000 samples
Epoch 1/10
49952/50000 [============================>.] - ETA: 0s - loss: 1.5267 - accuracy: 0.4439
50000/50000 [==============================] - 11s 223us/sample - loss: 1.5263 - accuracy: 0.4440 - val_loss: 1.2340 - val_accuracy: 0.5560
Epoch 2/10
50000/50000 [==============================] - 11s 220us/sample - loss: 1.1626 - accuracy: 0.5901 - val_loss: 1.0764 - val_accuracy: 0.6136
Epoch 3/10
50000/50000 [==============================] - 11s 222us/sample - loss: 1.0211 - accuracy: 0.6416 - val_loss: 1.0145 - val_accuracy: 0.6461
Epoch 4/10
50000/50000 [==============================] - 11s 221us/sample - loss: 0.9212 - accuracy: 0.6781 - val_loss: 0.9443 - val_accuracy: 0.6719
Epoch 5/10
50000/50000 [==============================] - 11s 222us/sample - loss: 0.8490 - accuracy: 0.7023 - val_loss: 0.9732 - val_accuracy: 0.6604
Epoch 6/10
50000/50000 [==============================] - 11s 222us/sample - loss: 0.7897 - accuracy: 0.7226 - val_loss: 0.9129 - val_accuracy: 0.6858
Epoch 7/10
50000/50000 [==============================] - 11s 221us/sample - loss: 0.7419 - accuracy: 0.7397 - val_loss: 0.9174 - val_accuracy: 0.6886
Epoch 8/10
50000/50000 [==============================] - 11s 221us/sample - loss: 0.7016 - accuracy: 0.7530 - val_loss: 0.8932 - val_accuracy: 0.6997
Epoch 9/10
50000/50000 [==============================] - 11s 221us/sample - loss: 0.6640 - accuracy: 0.7677 - val_loss: 0.8965 - val_accuracy: 0.6969
Epoch 10/10
50000/50000 [==============================] - 11s 223us/sample - loss: 0.6317 - accuracy: 0.7792 - val_loss: 0.8599 - val_accuracy: 0.7099```

ongtw commented 3 years ago

@ManuelSchneid3r Ran your code on my M1 MBA 8/512 and here are the CPU/GPU usages: Screenshot 2021-05-09 at 9 34 15 PM

ongtw commented 3 years ago

@ManuelSchneid3r Edited your code to enable GPU, here are the charts: Screenshot 2021-05-09 at 9 43 36 PM

CPU per epoch time = 13s, GPU per epoch time = 10s

arge-7 commented 3 years ago

Can you share that edited code or any other pointers you used to actually get GPU to run quickly? It seems like you've accomplished what a lot of us have been having trouble with: 1) using a very low amount of CPU and almost no GPU, or using entirely GPU but with speeds a few orders of magnitude slower than just using CPU in eager mode.

When I disable eager mode and set device to GPU, it would take probably a week to run that code. Im very curious about what you did differently.

Really appreciate the help, this is huge!

ongtw commented 3 years ago

Here's the code I use to run with GPU:

#import os
#os.environ["TF_DISABLE_MLC"] = "1"
#os.environ["TF_MLC_LOGGING"] = "1"
import tensorflow as tf
from tensorflow.python.compiler.mlcompute import mlcompute

tf.compat.v1.disable_eager_execution()
mlcompute.set_mlc_device(device_name='gpu')
print("is_apple_mlc_enabled %s" % mlcompute.is_apple_mlc_enabled())
print("is_tf_compiled_with_apple_mlc %s" % mlcompute.is_tf_compiled_with_apple_mlc())
print(f"eagerly? {tf.executing_eagerly()}")
print(tf.config.list_logical_devices())

from tensorflow.keras import datasets, layers, models

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10,
                    validation_data=(test_images, test_labels))

The only thing I did was to uncomment the mlcompute set GPU statement and reorder some lines for readability. The output is as shown:

(m1) $ python tf_m1_test.py 
is_apple_mlc_enabled True
is_tf_compiled_with_apple_mlc True
eagerly? False
[LogicalDevice(name='/device:CPU:0', device_type='CPU')]
Train on 50000 samples, validate on 10000 samples
2021-05-09 21:48:18.948717: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-05-09 21:48:18.953797: W tensorflow/core/platform/profile_utils/cpu_utils.cc:126] Failed to get CPU frequency: 0 Hz
Epoch 1/10
49856/50000 [============================>.] - ETA: 0s - loss: 1.5445 - accuracy: 0.4369/Users/dotw/miniforge3/envs/m1/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:2325: UserWarning: `Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.
  warnings.warn('`Model.state_updates` will be removed in a future version. '
50000/50000 [==============================] - 12s 247us/sample - loss: 1.5441 - accuracy: 0.4371 - val_loss: 1.3115 - val_accuracy: 0.5319
Epoch 2/10
50000/50000 [==============================] - 10s 197us/sample - loss: 1.1641 - accuracy: 0.5869 - val_loss: 1.0943 - val_accuracy: 0.6138
Epoch 3/10
50000/50000 [==============================] - 10s 198us/sample - loss: 1.0016 - accuracy: 0.6446 - val_loss: 0.9661 - val_accuracy: 0.6611
Epoch 4/10
50000/50000 [==============================] - 10s 196us/sample - loss: 0.8969 - accuracy: 0.6850 - val_loss: 0.9667 - val_accuracy: 0.6652
Epoch 5/10
50000/50000 [==============================] - 10s 194us/sample - loss: 0.8259 - accuracy: 0.7120 - val_loss: 0.9193 - val_accuracy: 0.6825
Epoch 6/10
50000/50000 [==============================] - 10s 194us/sample - loss: 0.7671 - accuracy: 0.7302 - val_loss: 0.8992 - val_accuracy: 0.6850
Epoch 7/10
50000/50000 [==============================] - 10s 198us/sample - loss: 0.7197 - accuracy: 0.7490 - val_loss: 0.9441 - val_accuracy: 0.6805
Epoch 8/10
50000/50000 [==============================] - 10s 197us/sample - loss: 0.6788 - accuracy: 0.7626 - val_loss: 0.8432 - val_accuracy: 0.7115
Epoch 9/10
50000/50000 [==============================] - 10s 198us/sample - loss: 0.6453 - accuracy: 0.7743 - val_loss: 0.8417 - val_accuracy: 0.7177
Epoch 10/10
50000/50000 [==============================] - 10s 204us/sample - loss: 0.6058 - accuracy: 0.7866 - val_loss: 0.8771 - val_accuracy: 0.7122

DLWCMD commented 3 years ago

I am experiencing similar results with ML Compute, although on an Intel-based MacBookPro. (See now closed issue #256 for background. After being pointed to this issue (#235), I decided to run ongtw's code above.

Function tf.config.list_logical_devices() reports code is running on the CPU [LogicalDevice(name='/device:CPU:0', device_type='CPU')], as does the debugger. However, the macOS Activity Monitor suggests differently.

These two images are of a "resting state," prior to running to running ongtw's script.

The following two images were taken immediately after completing execution of ongtw's script:

The images show full utilization of the AMD device and elevated usage of Cores 1 and 3 compared to the baseline state. Total execution time was 275 seconds.

I ran a second test under eager execution, which, per documentation, requires and automatically selects cpu processing. The two images below display the history of this run. GPU usage is similar, but CPU load is higher. Total execution time of 300 seconds.

My preliminary conclusions are 1) the GPU is being used in both use cases, regardless of the reported device and 2) selecting the CPU, as in the second run, seems to increase usage.

Are my conclusions valid and, more importantly, is the documented GPU/CPU usage intended?

Thanks.

ManuelSchneid3r commented 3 years ago

@DLWCMD The GPU missing in the device list is a known issue, already mentioned above. When I referenced this issue, I was not aware that you use Intel. You said that you "Cannot Set Device" and "regardless of settings (either 'gpu' or 'any'), the code is run on my CPU". This is why I linked you here. Well Now your GPU seem to work and your are using Intel…

DLWCMD commented 3 years ago

First, thanks very much for your attention to this issue and your quick responses. Also, I see by your link above that cpu is shown as device, even if gpu is selected and being used. I experience the same behavior on my intel-base system.

However, as shown in my comment above, when I enable eager execution, which forces the cpu to be selected, the gpu is also engaged. Is this the desired behavior?

DLWCMD commented 3 years ago

I ran the CNN test script in my (non-ML Compute) Conda environment (TF 2.4.1/Python 9.8.8) with a comparable execution time to ML Compute (+- 280 seconds). Activity monitor confirmed heavy use of all eight cores and no GPU activity.

By contrast, in my ML Compute environment, as shown above, the GPU is fully engaged, but supported by only four cores.

So, in this scenario at least, GPU + four cores is roughly equivalent to eight cores without ML Compute. Would you expect this on my system?

Thanks.

arge-7 commented 3 years ago

The plot thickens on this issue. When I run the code in https://github.com/apple/tensorflow_macos/issues/235#issuecomment-829127007 on my M1 Mac mini, it runs as expected, with full GPU activity on my activity monitor. But then when I use the same settings of disabling eager execution and specifying GPU when working with a simpler model (large tabular database with fewer layers, all densely connected), I get the behavior many others have noticed, with my models training very slowly, only using a small amount of CPU and no GPU. It seems like TF is ignoring the message to use GPU, and is just running on CPU without eager execution which is very slow. I don't understand why, since the CNN above runs as expected but my own model doesn't. I would post

ongtw commented 3 years ago

@arge-7 Interesting, if you don't mind sharing your code, I would be curious to run it on my M1 MBA and see if I get the same effect as you do.

arge-7 commented 3 years ago

The data I'm working with is protected health information, but here's a link to a Jupyter notebook that generates a similar synthetic database (10,000 rows, 1000 binary categorical columns) and a continuous target variable to predict as a regression problem, like charge of hospitalization for example. The notebook also contains code to create and train TF models on this synthetic data. In making this notebook to post here, I found something interesting. There are two models built and trained in the notebook. The first has hidden Dense layer sizes of 1000, and the second has hidden layer sizes of 10,000. The only difference is these two layers being an order of magnitude different. When I run the code for the first one, my GPU kicks in and it runs as expected. However, with the 10,000 layer size, the GPU stays quiet and my CPU tries to handle it while only using 20% of its activity. Weird.

https://github.com/arge-7/NIS/blob/main/make_dummy_data.ipynb

edit: I played around with this more and may have gotten a little more insight on this. On my original code that I can't share, I had done some feature engineering with sklearn StandardScaler and PolynomialFeatures. I realized that the order I had them in didn't make sense - I was scaling and then adding the polynomial feature transformation. With only one change to my code, changing the order around so that I added poly features and then scaled the data, it switched from wimpy CPU to full power GPU. So it seems like models that are particularly large or complicated or contain more variable input data are prone to running on CPU. It almost seems like a memory issue, although my memory use isn't impressive in either case. This is with 16 GB of ram in my Mac mini.

ongtw commented 3 years ago

@arge-7 I ran your Jupyter notebook code. Indeed, the second model ran very slowly. Here's why: It is actually using the GPU but its huge size causes a lot of swapping which kills the performance. See my CPU/GPU charts below:

cpu_gpu

arge-7 commented 3 years ago

@ongtw Yep you're right, I can reproduce that. As I decreased the size of the hidden layers in the large model, I got to a point where the GPU was starting to show more activity. When I continued to gradually decrease layer sizes, the GPU activity increased while the CPU activity decreased. My activity monitor memory stats don't seem to represent this though. Even with a massive model slowed down by the swapping, my monitor shows minimal memory pressure, around 11 GB used, 5 GB free, and around 750 MB of swap used.

I just noticed that this issue is addressed at the bottom of the readme where it describes this as paging. I also experimented with the TF_MLC_LOGGING command in the readme to compare the outputs of the reasonable vs huge models, but I didn't see any errors or even any big differences between the outputs in the terminal. They both confirm that ML Compute is using the GPU, even when it doesn't appear to due to the memory paging.

I see that the official TF repo has ways to try to limit this. I was going to try using the Apple implementation with TF_MLC_ALLOCATOR_INIT_VALUE and report my results.

dseddah commented 3 years ago

Hi, just a quick word to let people know, even though it's a bit off topic, that on my macbook pro 15'' 2018, TF+ml compute seems to actually uses the GPU. Benchs I've run show that the gain speed is between x2 and x3 (depending on the code) compared to the cpu alone. It's really noticable with the activity monitor.

mnist bench on GPU Training set contained 60000 images Testing set contained 10000 images Model achieved 0.88 testing accuracy Training and testing took 48.41 seconds

mnist bench on CPU Training set contained 60000 images Testing set contained 10000 images Model achieved 0.88 testing accuracy Training and testing took 143.81 seconds

I'm really looking forward to getting a mbpro 16' M1 when they're out :)

here's the script that I used (with export TF_XLA_FLAGS=--tf_xla_enable_xla_devices) http://pauillac.inria.fr/~seddah/fashin_mnist.py

Djamé

dseddah commented 3 years ago

u!pdate: once the mac gets too hot, it seems to revert back to CPU, which of course are now running at 23% of their frequency, and the mac is barely responding. Can someone tell me if the m1 gets hot when using the gpu ?

Djamé

arge-7 commented 3 years ago

@dseddah no it doesn’t. Even with near 100% GPU utilization, my temp stays below 130 F. Fans don’t even turn on usually. That’s the beauty of the M1 though.

dseddah commented 3 years ago

thanks. that's going the main reason I'm gonna get one. I can't stand those fans anymore. Weirdly, it wasn't as annoying with Mojave but since I installed Big Sur, everything got weird: unexplicable slow down, constant overheating, etc...

arge-7 commented 3 years ago

@dseddah yeah, I had been using a maxed out 16 inch MacBook Pro (obviously Intel) and I just couldn’t wait to try the M1 chips any longer so I got a Mac mini to play with. I can’t even use the MacBook Pro anymore just for psychological reasons, because it’s such a bad feeling using a machine that’s four times the cost of the mini, a quarter of the power, hot to the touch, and fans at full blast. You can’t go back to using an Intel machine after using Apple silicon.

There are lots of rumors going around that the next iteration of the chip for the next generation of MacBook pros is right around the corner.

ongtw commented 3 years ago

@dseddah My M1 MacBook Air 8/512 does not even have a fan. 🙂 My advice is to max out the RAM since there is no way around this if you want to run large models, not even with Apple's Unified Memory Architecture.

apple / tensorflow_macos

I Can't use GPU on m1 MacBook Pro #235