Training stops after step 4000

festinais commented 2 years ago

Training always stops after this step:

Visualize (num_rows x 8) fake image canvans. Save image canvas to output/figures/CUSTOM-ReACGAN-train-2022_03_23_14_40_18/generated_canvas_4000.png Start Evaluation (4000 Step): CUSTOM-ReACGAN-train-2022_03_23_14_40_18 generate images and stack features (36808 images).

I'm training ReACGAN with my own dataset. However, the training stops always after this step and there are no logs to know why!

Any help or idea would be highly appreciated!

mingukkang commented 2 years ago

Did you use the -metrics prdc option?

It takes much time (in my case, 6 hours for ImageNet) to calculate the prdc.

So it is highly recommended for you to turn on 'is and fid' options only for training.

Thank you.

Best,

Minguk

festinais commented 2 years ago

Thank you for your reply! I was using the prdc metric as well, I'm trying now without and will see.

There is another thing that I'm not sure if it's correct: When loading the train and validation data, based on the logs it says that the length of eval dataset is the same as train dataset.

The logs: Load CUSTOM train dataset. Train dataset size: 36808

then for validation: (takes the same path as the training set) Load CUSTOM train dataset. Eval dataset size: 36808

The way I'm training is: "python3 src/main.py -t -metrics fid -cfg src/configs/CIFAR10/ReACGAN.yaml -data data/CUSTOM -save output/"

I added the folders: data/CUSTOM/train and data/CUSTOM/valid. - as described in the documentation.

Any ideas are highly appreciated! Thank you

alex4727 commented 2 years ago

If you are using custom training dataset that doesn't contain extra valid/test dataset, StudioGAN will calculate metric (FID) using your train dataset. So yes, log printing same value for train/eval dataset is correct behaviour. But is your custom dataset CIFAR subset? If you use configs in -cfg src/configs/CIFAR10/, images will be resized into 32x32 resolution and you'll need 10 classes. It seems like you'll need to build your own configs for custom dataset in this case. Thanks

festinais commented 2 years ago

Hi thank you for your feedback! I changed only the DATA part inside yaml file. Also, for the MODEL i changed the z_dim only. I guess the other parts should remain as is right?

As for the interrupted training, this is still happening.

It always stops in this step: "generate images and stack features (36808 images)". However, I checked that sometimes it stops in a different progess step, for example now it stopped at step 4000 but sometimes it goes more than 4000 (so it's not always fixed). I'm not sure how to interpret this.

This is my command: CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -t -metrics fid -cfg src/configs/CIFAR10/ReACGAN.yaml -data data/CUSTOM -save output/ >> logs.txt &

I assume my dataset is not that large in the end. Number of samples are: 36808, with 7 classes.

Any ideas are highly appreciated!

mingukkang commented 2 years ago

That procedure is conducted in "generate_images_and_stack_features" function in ./src/metrics/features.py, so could you find the exact point where the training stops?

You can use the \textit{pdb} debugger to identify this.

Thank you.

festinais commented 2 years ago

Thank you for your feeback! I'm training now not as a background process and it didn't stop yet. However, the training is very slow - 20% for 9 hours now in 4 GPUs. I'm assuming bcs the training data is being used for evaluation?

Do you have any tips on test data? I'm using my own custom dataset (medical data). Is it recommended to have a test folder inside validation folder? I didn't see it in the documentation.

Also, is there any way to resume training from the last checkpoint? Or should I include this functionality myself?

Thank you! I appreciate any feedback!

mingukkang commented 2 years ago

However, the training is very slow - 20% for 9 hours now in 4 GPUs. I'm assuming bcs the training data is being used for evaluation?

=> I am not sure whether training dataset used for evaluation causes the slow training. If you use -ref "valid" or "test" option, you can measure the metrics not using "training" dataset but "valid" or "test" dataset.

Do you have any tips on test data? I'm using my own custom dataset (medical data). Is it recommended to have a test folder inside validation folder? I didn't see it in the documentation.

=> If you follow the data structure that I wrote in README, you can change the dataset for evaluation using -ref "train", -ref "valid", or -ref "test" option.

Also, is there any way to resume training from the last checkpoint? Or should I include this functionality myself?

=> Yes, you can resume your training using -ckpt CHECKPOINT_PATH option. Please refer to README file.

Best,

Minguk

festinais commented 2 years ago

Thanks a lot for this quick feedback! Do you recommend to use -ref "valid" or "train" ?

mingukkang commented 2 years ago

Yes, I recommend you to use -ref "valid" during training.

After training is done, you can evaluate your trained model using -ref "test" option.

Thk.

festinais commented 2 years ago

Thank you! I also need to ask regarding the image generation process after the training is finished. Which command should I use to only generate images in the end and not train? How do I specify the seeds for the generator? I want to generate let's say 8000 samples for class 2.

mingukkang commented 2 years ago

How do I specify the seeds for the generator? ==> --seed SEED_NUMBER

I want to generate let's say 8000 samples for class 2. ==> You can use the command below to generate fake images from a pre-trained generator.

CUDA_VISIBLE_DEVICES=0 python3 src/main.py blablabla -ckpt CKPT_PATH_FOR_GAN -sf --seed 1234 -sf_num NUMBER_OF_TOTAL_IMAGE

, where NUMBER_OF_TOTAL_IMAGE is 8000*num_classes.

The above command will generate approximately 8000 images per class.

Thank you.

festinais commented 2 years ago

I pulled the last changes from master (for the reason to generate images) and while training with this command as before (used to work):

CUDA_VISIBLE_DEVICES=1,2,3,4 python3 src/main.py -t -metrics fid -ckpt=output/checkpoints/CUSTOM-ReACGAN-train-2022_03_29_19_15_17/ -ref "valid" -cfg src/configs/CIFAR10/ReACGAN.yaml -data data/CUSTOM -save output/

the ref "valid" attribute is not known anymore

Do i have to make any changes?

Thank you!

festinais commented 2 years ago

Also, is there a way to generate data with different checkpoints from different steps? Thank you

festinais commented 2 years ago

Also just to add some comments when generating data.

This is the command I used to generate data with the last updates from master:

python3 src/main.py -cfg src/configs/CIFAR10/ReACGAN.yaml -save results/ -ckpt=output/checkpoints/CUSTOM-ReACGAN-train-2022_03_29_19_15_17/ -sf --seed 1234 -sf_num 70 -data data/CUSTOM -ref "valid"

I had to add -ref "valid" and also -data, I'm not sure why is this needed. Also, the sf_num it doesn't distribute the generated samples like 10 for each class (in my case I have 7 classes).

mingukkang commented 2 years ago

Also, is there a way to generate data with different checkpoints from different steps? => No, StudioGAN does not support this function.

the ref "valid" attribute is not known anymore => plase change self.num_eval[self.RUN.ref_dataset] to self.num_eval["test"].

Best,

Minguk

festinais commented 2 years ago

I'm still having the problem with this!

Any help is appreciated!

mingukkang commented 2 years ago

please print(self.num_eval)

and input an appropriate key for evaluation.

Thank you.

POSTECH-CVLab / PyTorch-StudioGAN

Training stops after step 4000 #136