SteveOv / ebop_maven

EBOP Model Automatic input Value Estimation Neural network
GNU General Public License v3.0
1 stars 0 forks source link

Investigate "UserWarning: Your input ran out of data; interrupting training." when training CNN #4

Closed SteveOv closed 6 months ago

SteveOv commented 6 months ago

See the following text whenever training the CNN: UserWarning: Your input ran out of data; interrupting training.

Training continues and I get a usable model. Putting a repeat() in, as advised, just means that epoch 1 never seems to end

Training the model on 80000 training and 10000 validation instances, with a further 10000 instances held back for test.
Epoch 1/10
   1000/Unknown 12s 9ms/step - loss: 0.1261 - mse: 0.0508/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(value)
1000/1000 ━━━━━━━━━━━━━━━━━━━━ 13s 10ms/step - loss: 0.1261 - mse: 0.0508 - val_loss: 0.0555 - val_mse: 0.0157
Epoch 2/10

It only seems to be a problem on the first epoch.

SteveOv commented 6 months ago

This code is largely unchanged from the POC, which ran on tf 2.6 without error.

SteveOv commented 6 months ago

I cannot stop this warning.

Internet search throws up issues people have had over the years with steps_per_epoch and batch_size and their inter-relations. That's not a problem here; I have 80,000 instances and a batch size of 80, so I'm expecting 1000 steps_per_epoch. I can pass these values in as arguments and the warning still occurs. If I reduce steps_per_epoch without changing the other values, so effectively stopping early, I still get the warning a couple of steps before the end of the epoch. Same for the validation_data.

Changing verbose seems to alter when the warning happens.

Adding repeat(2) to the dataset pipelines doesn't stop the warning. It just happens later (near the end of epoch 1 if verbose == 1).

SteveOv commented 6 months ago

My gut feeling is that this is either;

I'm going to see what happens with the GPU enable.

SteveOv commented 6 months ago

Interesting. In my new conda env it doesn't see the GPU.

And pip install tensorflow-gpu falls over. FFS I can't believe a modern language and environment is so sensitive to versioning. I thought we'd got over this shite in the 2000s.

SteveOv commented 6 months ago

OK, tensorflow-gpu deprecated. Should be part of full tensorflow install.

SteveOv commented 6 months ago

Having worked through #10 I've finally been able to confirm it's still any issue when training on the GPU.

I have noticed that there are more diags messages under a venv (I'm probably supressing them with TF_CPP_MIN_LOG_LEVEL in the conda env). Will look into this.

SteveOv commented 6 months ago

Behaviour under the venv displays the following for every training epoch;

Epoch 1/100
    997/Unknown 12s 10ms/step - loss: 0.1381 - mse: 0.05692024-04-08 09:36:39.542327: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node IteratorGetNext}}]]
/home/steveo/anaconda3/lib/python3.11/contextlib.py:155: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(typ, value, traceback)
2024-04-08 09:36:40.536080: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node IteratorGetNext}}]]
1000/1000 ━━━━━━━━━━━━━━━━━━━━ 13s 11ms/step - loss: 0.1379 - mse: 0.0568 - val_loss: 0.0567 - val_mse: 0.0167

Doesn't seem to stop training proceding.

I suspect the conda env has the same messages, but they're supressed after the first epoch ... confirmed.

SteveOv commented 6 months ago

That helps. Looks like this is another issue with TensorFlow 2.16. See

It's not causing any failures and hopefully will be fixed in a future version. I'll update my docs and close this with not action required.