apple / tensorflow_macos

TensorFlow for macOS 11.0+ accelerated using Apple's ML Compute framework.
Other
3.67k stars 308 forks source link

Model training on cpu (Intel) throws seg fault #19

Open tux-o-matic opened 3 years ago

tux-o-matic commented 3 years ago

First tests using this fork, running model training against Cifar10 dataset for benchmark. But during the first epoch I encounter:

Total params: 309,290
Trainable params: 308,394
Non-trainable params: 896
_________________________________________________________________
Epoch 1/100
5000/5000 [==============================] - ETA: 0s - loss: 2.3416 - accuracy: 0.3065zsh: segmentation fault  python cifar10_cnn.py
multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown

Explicitly setting to run on the GPU however works. But much slower (Intel integrated graphics)

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name='gpu')
tensorflow.config.run_functions_eagerly(False)

Python 3.8.6 from Mac Ports if it makes any difference.

anna-tikhonova commented 3 years ago

@tux-o-matic Thank you for reporting this issue. Could you, please, point us to or attach an example you are running? This way, we can reproduce this issue locally and investigate.

tux-o-matic commented 3 years ago

Hi @anna-tikhonova , I uses this Python code. Just needs the TF fork and NumPy:

python cifar10_cnn.py

In my case, on a MacBook Air with Intel chips, the backend seems to choose the CPU by default and then throws the error. However, if I specify

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name='gpu')
tensorflow.config.run_functions_eagerly(False)

Then the model gets trained, I can see with the Activity Monitor that Python threads are offloading work to the GPU. But on this integrated Intel GPU the perf is worse than the CPU and even PlaidML as a backend for TF could do better on the GPU.

hughack commented 3 years ago

If you need another example, running this code (from #35 ) also defaults to CPU and Seg Faults.

import tensorflow as tf

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

Machine specs: MacOS 11.0.1 on MacBook Pro, 15 inch, 2019. 2.3 GHz 8-Core Intel Core i9 16 GB 2400 MHz DDR4 Radeon Pro 560X 4 GB

pooyadavoodi commented 3 years ago

@tux-o-matic @hughack I apologize for the late reply. I just tried both of the scripts you provided. I'm not able to reproduce the issue. It's possible that it is resolved in a MacOS update. Could you please try again using an updated MacOS and let me know if you can still reproduce this?

tux-o-matic commented 3 years ago

Hi @pooyadavoodi. On an up to date BigSur, Python 3.8.7 and latest release of this project, still hit the same error.

pooyadavoodi commented 3 years ago

I managed to reproduce the segfault from @hughack's script using v0.1alpha0, and that issue is resolved in the latest release v0.1alpha2.

@tux-o-matic Could you share the BigSur version you are using? Also are you using the python that comes with the OS, otherwise how did you install it?

tux-o-matic commented 3 years ago

I'm testing from BigSur 11.0.1. Python 3.8.7comes from MacPorts. Earlier tests were on older point release of Python 3.8, still from MacPorts.

atw1020 commented 3 years ago

this appears to be the same as #127

I posted over there that I've found that this issue seems to be tied to batch size, where the segmentation fault occurs with sufficiently large batches. "Sufficiently large" appears to depend on the Neural network itself. However, all of the neural networks I have tried so far all experience this segfault when the batch size is larger than a certain amount. It's probably possible to solve or replicate this issue by increasing or decreasing your batch size.

I am still experiencing this on the february alpha build and I am using a Conda environment described on this page. (some of the pip commands need to be updated to match the new file names) I hope this helps you replicate the issue.

Also, using @tux-o-matic's workaround I was able to get my network to stop Segfaulting but it caused a memory leak instead (?!?). It appeared to run faster on GPU than it did on CPU (until I run out of memory, that is).

tux-o-matic commented 3 years ago

Thanks @atw1020 , indeed reducing the batch size in my benchmark allows epochs to complete on cpu. It's an interesting behaviour. I don't expect to be able to use large batch sizes on a laptop with integrated GPU, but when so much is shared. It's surprising that TF with CoreML is so limited on CPU, yet the GPU with the same memory can handle larger batch sizes. For reference, the original benchmark used 32as batch size, that worked only on the GPU. Taking it down to 16 works on the CPU (20is too high, crashes again).

atw1020 commented 3 years ago

I'm seeing nonlsegfault issues on 0.1-alpha3 but I'm still getting errors that are solved by using a smaller batch size. Going to keep investigating and hopefully get some new code to reproduce the issue I'm seeing

atw1020 commented 3 years ago

I've been trying to replicate this issue on 0.1-alpha3 and I haven't been able to so I'm becoming pretty confidant that this issue was fixed in that patch. There seem to be other bugs related to batch size but this one has been addressed. Update if you are still experiencing this issue