On Apple Silicon M1, unexpected very low test accuracy on LSTM

danbricedatascience commented 3 years ago

Hardware : MacBook Air M1 8GB / 512GB

When benchmarking the same LSTM super simple model on MNIST, while the loss is decreasing as expected and the train accuracy also increase as expected the test accuracy is dramatically low on Apple Silicon MacBook Air compared to the same model running on 4 different Intel CPU and GPU configuration with TF 2.2 or 2.3 (iMac 27, Google Colab with T4 GPU, Xeon 8 cores instances, K80 GPU).

Accuracy on AppleSilicon : 0.097 (while the last training loss was 0.0648 and accuracy was 0.9803)
Accuracy on the other configuration : all of them are between 0.9761 and 0.9810

What can explain such extreme overfit on Apple Silicon only ?

On the other hand the same thing replacing the LSTM by CNN or MLP works perfectly and is consistent with all the other configurations.

import os
import numpy as np
import time

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.utils import to_categorical

mnist = tf.keras.datasets.mnist

(train_images,train_labels),(test_images,test_labels) = mnist.load_data()
train_images=train_images.astype('float32')/255
test_images=test_images.astype('float32')/255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

model = tf.keras.models.Sequential()
model.add(layers.LSTM(128,input_shape=train_images.shape[1:]))
model.add(layers.Dense(10,activation='softmax'))

model.compile(optimizer = 'rmsprop',
             loss = 'categorical_crossentropy',
             metrics = ['accuracy'])

print("Start Learning with tensorflow.keras")

start = time.time()

model.fit(train_images,train_labels,epochs=5,batch_size=128)

print("Ran in {} seconds".format(time.time() - start))

test_loss, test_acc = model.evaluate(test_images,test_labels)

print('test_acc:',test_acc)

ac-93 commented 3 years ago

Just a guess but could this be because the evaluation is being performed on the M1s Neural Engine, which is likely fp16? where as training is most likely fp32 on the GPU. Perhaps LSTM nets are more sensitive to this change than Conv or MLP nets?

danbricedatascience commented 3 years ago

@ac-93 Good guess as it can also explain why the CPU is underutilized, because the overall performances are really good, only 30% slower than a Tesla T4 GPU and 4 time faster than a 8 cores Xeon Platinum instance.

When I select the gpu, I clearly get a full usage of it but selecting the CPU is unclear, it's underutilized while its super fast, if we exclude an issue in the CPU monitoring itself the only explanation is that it uses dedicated component of the SoC like the two ML Accelerators or even the NE without telling it.

Maybe Apple specialists can answer it ?

jhjiang2020 commented 3 years ago

Same problem. I was training a CNN model and the training accuracy and val_accuracy were both very high (>97%). However, the test accuracy was unexpectedly low (~25%) when I run model.evaluate (test_data, test_labels)

Then I tried running model.evaluate (train_data, train_labels), but still got the same low accuracy. Clearly M1 chip has some issues with the model evaluation process.

anna-tikhonova commented 3 years ago

@danbricedatascience Thank you for reporting this issue! We will investigate and get back to you.

danbricedatascience commented 3 years ago

I've found an additional information.

This problem disapears when setting the batch_size in the evaluate() function to a value greater or equal to the one used for the training.

Example:

model.fit(train_images,train_labels,epochs=5,batch_size=128)

## returns very low accuracy (< 0.2 while training shows 0.97 on val set)
## batch_size default is 32
test_loss, test_acc = model.evaluate(test_images,test_labels)

## returns very low accuracy (< 0.2 while training shows 0.97 on val set)
test_loss, test_acc = model.evaluate(test_images,test_labels,batch_size=64)

## returns expected accuracy : 0.93
test_loss, test_acc = model.evaluate(test_images,test_labels,batch_size=128)

## returns expected accuracy : 0.95
test_loss, test_acc = model.evaluate(test_images,test_labels,batch_size=256)

apple / tensorflow_macos

On Apple Silicon M1, unexpected very low test accuracy on LSTM #55