Closed stefanCCS closed 2 years ago
Hi Stefan, really hard to guess what's wrong here. When a GPU runs out of memory, it usually complains about it with an error message. I had this only when trying to train large models in parallel on smaller GPUs. It could be that you're out of ram (should be visible in some system monitor), this could be avoided by streaming the dataset from disk (--train.preload False
). Especially when you're using augmentation, >100k samples can be problematic.
Much better with --train.preload False
.
Now I see this warning here (at begin of each Epoch):
WARNING 2022-08-10 13:27:35,278 tfaip.util.multiprocessing.dat: Invalid data. Skipping for the 1. time.
What does it mean?
(and it looks like it still runs ok...)
btw: I have seen, that GPU memory usage is pretty high: --> so, maybe this really was the reason. Host processor's RAM is pretty ok (about 6 GB out of 16GB).
"Invalid data" could be an image file that is broken/empty/too short for the given text. If the counter does not grow > 100, I would simply ignore that. I don't know if the output of nvidia_smi is helpful here: as far as I remember, tensorflow grabs and reserves what it can get hold of.
"Ignoring" is one the the best things I can do ;-)
Otherwise, I will close this Issue here now as it runs ok, and it does not look like that we really find a root cause ...
I have tried to use calamari-train like this:
calamari-train --device.gpus 0 --network deep3 --trainer.epochs 2 --train.batch_size 64 --trainer.output_dir model-01/ --early_stopping.n_to_go 0 --train.images "train/*.png" --val.images "eval/*.png"
During data preparation I have got this exception here below. Hint: My training set is pretty big:
My machine has 16 GB RAM (for CPU). and as GPU GeForce RTX 3050 with 8 GB RAM
Any idea?