Closed SteveOv closed 6 months ago
This code is largely unchanged from the POC, which ran on tf 2.6 without error.
I cannot stop this warning.
Internet search throws up issues people have had over the years with steps_per_epoch and batch_size and their inter-relations. That's not a problem here; I have 80,000 instances and a batch size of 80, so I'm expecting 1000 steps_per_epoch. I can pass these values in as arguments and the warning still occurs. If I reduce steps_per_epoch without changing the other values, so effectively stopping early, I still get the warning a couple of steps before the end of the epoch. Same for the validation_data.
Changing verbose seems to alter when the warning happens.
Adding repeat(2) to the dataset pipelines doesn't stop the warning. It just happens later (near the end of epoch 1 if verbose == 1).
My gut feeling is that this is either;
I'm going to see what happens with the GPU enable.
Interesting. In my new conda env it doesn't see the GPU.
And pip install tensorflow-gpu falls over. FFS I can't believe a modern language and environment is so sensitive to versioning. I thought we'd got over this shite in the 2000s.
OK, tensorflow-gpu deprecated. Should be part of full tensorflow install.
Having worked through #10 I've finally been able to confirm it's still any issue when training on the GPU.
I have noticed that there are more diags messages under a venv (I'm probably supressing them with TF_CPP_MIN_LOG_LEVEL in the conda env). Will look into this.
Behaviour under the venv displays the following for every training epoch;
Epoch 1/100
997/Unknown 12s 10ms/step - loss: 0.1381 - mse: 0.05692024-04-08 09:36:39.542327: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
[[{{node IteratorGetNext}}]]
/home/steveo/anaconda3/lib/python3.11/contextlib.py:155: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
self.gen.throw(typ, value, traceback)
2024-04-08 09:36:40.536080: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
[[{{node IteratorGetNext}}]]
1000/1000 ━━━━━━━━━━━━━━━━━━━━ 13s 11ms/step - loss: 0.1379 - mse: 0.0568 - val_loss: 0.0567 - val_mse: 0.0167
Doesn't seem to stop training proceding.
I suspect the conda env has the same messages, but they're supressed after the first epoch ... confirmed.
That helps. Looks like this is another issue with TensorFlow 2.16. See
It's not causing any failures and hopefully will be fixed in a future version. I'll update my docs and close this with not action required.
See the following text whenever training the CNN: UserWarning: Your input ran out of data; interrupting training.
Training continues and I get a usable model. Putting a repeat() in, as advised, just means that epoch 1 never seems to end
It only seems to be a problem on the first epoch.