Open danbricedatascience opened 3 years ago
Just a guess but could this be because the evaluation is being performed on the M1s Neural Engine, which is likely fp16? where as training is most likely fp32 on the GPU. Perhaps LSTM nets are more sensitive to this change than Conv or MLP nets?
@ac-93 Good guess as it can also explain why the CPU is underutilized, because the overall performances are really good, only 30% slower than a Tesla T4 GPU and 4 time faster than a 8 cores Xeon Platinum instance.
When I select the gpu, I clearly get a full usage of it but selecting the CPU is unclear, it's underutilized while its super fast, if we exclude an issue in the CPU monitoring itself the only explanation is that it uses dedicated component of the SoC like the two ML Accelerators or even the NE without telling it.
Maybe Apple specialists can answer it ?
Same problem. I was training a CNN model and the training accuracy and val_accuracy were both very high (>97%). However, the test accuracy was unexpectedly low (~25%) when I run model.evaluate (test_data, test_labels)
Then I tried running model.evaluate (train_data, train_labels)
, but still got the same low accuracy. Clearly M1 chip has some issues with the model evaluation process.
@danbricedatascience Thank you for reporting this issue! We will investigate and get back to you.
I've found an additional information.
This problem disapears when setting the batch_size
in the evaluate()
function to a value greater or equal to the one used for the training.
Example:
model.fit(train_images,train_labels,epochs=5,batch_size=128)
## returns very low accuracy (< 0.2 while training shows 0.97 on val set)
## batch_size default is 32
test_loss, test_acc = model.evaluate(test_images,test_labels)
## returns very low accuracy (< 0.2 while training shows 0.97 on val set)
test_loss, test_acc = model.evaluate(test_images,test_labels,batch_size=64)
## returns expected accuracy : 0.93
test_loss, test_acc = model.evaluate(test_images,test_labels,batch_size=128)
## returns expected accuracy : 0.95
test_loss, test_acc = model.evaluate(test_images,test_labels,batch_size=256)
Hardware : MacBook Air M1 8GB / 512GB
When benchmarking the same LSTM super simple model on MNIST, while the loss is decreasing as expected and the train accuracy also increase as expected the test accuracy is dramatically low on Apple Silicon MacBook Air compared to the same model running on 4 different Intel CPU and GPU configuration with TF 2.2 or 2.3 (iMac 27, Google Colab with T4 GPU, Xeon 8 cores instances, K80 GPU).
What can explain such extreme overfit on Apple Silicon only ?
On the other hand the same thing replacing the LSTM by CNN or MLP works perfectly and is consistent with all the other configurations.