Training failed - Githubissues

gliangMT commented 8 months ago

Hi， my develop environment is 3070Ti with tensorflow docker container image, here is my results about training model:

Epoch 1/30
2024-02-28 10:46:46.035565: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
3003/3004 [============================>.] - ETA: 0s - loss: 3.4103 - acc: 0.0341
Epoch 1: val_loss improved from inf to 3.41917, saving model to [./social_weavers_model.h5](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f74656e736f72666c6f775f74657374227d-0040ssh-002dremote-002b7b22686f73744e616d65223a223330373054692d4d54536572766572227d.vscode-resource.vscode-cdn.net/tf/test_birds_id/Train_CNN/social_weavers_model.h5)
/usr/local/lib/python3.11/dist-packages/keras/src/engine/training.py:3103: UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')`.
  saving_api.save_model(
3004/3004 [==============================] - 138s 46ms/step - loss: 3.4103 - acc: 0.0340 - val_loss: 3.4192 - val_acc: 0.0337
Epoch 2/30
3003/3004 [============================>.] - ETA: 0s - loss: 3.3761 - acc: 0.0359
Epoch 2: val_loss did not improve from 3.41917
3004/3004 [==============================] - 137s 46ms/step - loss: 3.3761 - acc: 0.0360 - val_loss: 3.4291 - val_acc: 0.0333
Epoch 3/30
3004/3004 [==============================] - ETA: 0s - loss: 3.3754 - acc: 0.0365
Epoch 3: val_loss did not improve from 3.41917
3004/3004 [==============================] - 137s 46ms/step - loss: 3.3754 - acc: 0.0365 - val_loss: 3.4315 - val_acc: 0.0333
Epoch 4/30
3004/3004 [==============================] - ETA: 0s - loss: 3.3753 - acc: 0.0347
Epoch 4: val_loss did not improve from 3.41917
Restoring model weights from the end of the best epoch: 1.
3004/3004 [==============================] - 137s 46ms/step - loss: 3.3753 - acc: 0.0347 - val_loss: 3.4325 - val_acc: 0.0333
Epoch 4: early stopping

It can be seen that the val loss does not decrease at all, the value of var_loss is exactly the probability of the individual in the population, also the var_loss did not improve at all. Why this happen?

gliangMT commented 8 months ago

And here is my training history:

gliangMT commented 8 months ago

The following picture is another task that I ran with the pretrained vgg19 model in the same environment, the result looks correct to me.

AndreCFerreira / Bird_individualID

Training failed #3