amzn / convolutional-handwriting-gan

ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation (CVPR20)
https://www.amazon.science/publications/scrabblegan-semi-supervised-varying-length-handwritten-text-generation
MIT License
264 stars 55 forks source link

Model training stops #11

Closed darraghdog closed 3 years ago

darraghdog commented 3 years ago

Hi, Thank you for the excellent work in this repo. When training, I am finding in some configuration the model stops training; the run stays in GPU memory with no error but the updates to logs stop. This happened for example when I ran with the unsupervised CVL dataset - after about 10 epochs the training stopped - no error, just logs stopped being updated, and no new checkpoints written. I am wondering if you experienced this. It may be that I have some incorrect character in the Lexicon or otherwise. Sorry about the vague details. Best, Darragh. P.s. Feel free to close the issue if you did not have this case.

rlit commented 3 years ago

It never happened to me, @sharonFogel hour about you?

Sorry for the trivial question, but what is the value of --num_epochs?

darraghdog commented 3 years ago

This would be an example of my settings, python train.py --name_prefix demov14 --lex Datasets/Lexicon/english_lines.txt --continue_train --num_epochs 100 \ --dataname IAMlinescharH32W16rmPunct --capitalize --no_html --gpu_ids 0 --batch_size 32

I have created a new lexicon to train on lines, so lines also go through the generator. The longest line would be around 50 characters. My memory usage from nvidia-smi on GPU is 19814MiB / 40537MiB, however at times, particularly on the initialisation of the model memory can go up to ~ 40GB usage.

I have also added some new data types into the dataset. With only IAM it always runs ok, but the problem happens when I add CVL unsupervised, and one other data source.

For now I am just restarting the job manually to reach the ~100 epochs - and it starts form the latest checkpoint - it typically stops after about 20 epochs. I will try some more things to narrow down the issue and let you know if I find anything.

darraghdog commented 3 years ago

Closing this issue, with some changes in data used it solved some of the problems, so I am pretty sure it is data related.