Closed darraghdog closed 3 years ago
It never happened to me, @sharonFogel hour about you?
Sorry for the trivial question, but what is the value of --num_epochs
?
This would be an example of my settings,
python train.py --name_prefix demov14 --lex Datasets/Lexicon/english_lines.txt --continue_train --num_epochs 100 \ --dataname IAMlinescharH32W16rmPunct --capitalize --no_html --gpu_ids 0 --batch_size 32
I have created a new lexicon to train on lines, so lines also go through the generator. The longest line would be around 50 characters. My memory usage from nvidia-smi
on GPU is 19814MiB / 40537MiB
, however at times, particularly on the initialisation of the model memory can go up to ~ 40GB usage.
I have also added some new data types into the dataset. With only IAM it always runs ok, but the problem happens when I add CVL unsupervised, and one other data source.
For now I am just restarting the job manually to reach the ~100 epochs - and it starts form the latest checkpoint - it typically stops after about 20 epochs. I will try some more things to narrow down the issue and let you know if I find anything.
Closing this issue, with some changes in data used it solved some of the problems, so I am pretty sure it is data related.
Hi, Thank you for the excellent work in this repo. When training, I am finding in some configuration the model stops training; the run stays in GPU memory with no error but the updates to logs stop. This happened for example when I ran with the unsupervised CVL dataset - after about 10 epochs the training stopped - no error, just logs stopped being updated, and no new checkpoints written. I am wondering if you experienced this. It may be that I have some incorrect character in the Lexicon or otherwise. Sorry about the vague details. Best, Darragh. P.s. Feel free to close the issue if you did not have this case.