clovaai / stargan-v2

StarGAN v2 - Official PyTorch Implementation (CVPR 2020)
Other
3.48k stars 654 forks source link

Code crashes without error!! #67

Open pkhigh opened 4 years ago

pkhigh commented 4 years ago

Hi, I was trying to train a model with celeb dataset on a cluster of 8 GPUs. I am currently using only a single GPU.

OUTPUT LOG: Namespace(batch_size=4, beta1=0.0, beta2=0.99, checkpoint_dir='expr/checkpoints', ds_iter=100000, eval_dir='expr/eval', eval_every=50000, flr=1e-06, hidden dim=512, img_size=256, inp_dir='assets/representative/custom/female', lambda_cyc=1.0, lambda_ds=1.0, lambda_reg=1.0, lambda_sty=1.0, latent_dim=16, lm_path=' expr/checkpoints/celeba_lm_mean.npz', lr=0.0001, mode='train', num_domains=2, num_outs_per_domain=10, num_workers=2, out_dir='assets/representative/celeba_hq /src/female', print_every=10, randcrop_prob=0.5, ref_dir='assets/representative/celeba_hq/ref', result_dir='expr/results', resume_iter=0, sample_dir='expr/sa mples', sample_every=5000, save_every=10000, seed=777, src_dir='assets/representative/celeba_hq/src', style_dim=64, total_iters=100000, train_img_dir='data/c eleba_hq/train', val_batch_size=32, val_img_dir='data/celeba_hq/val', w_hpf=1.0, weight_decay=0.0001, wing_path='expr/checkpoints/wing.ckpt') Number of parameters of generator: 43467395 Number of parameters of mapping_network: 2438272 Number of parameters of style_encoder: 20916928 Number of parameters of discriminator: 20852290 Number of parameters of fan: 6333603 Initializing generator... Initializing mapping_network... Initializing style_encoder... Initializing discriminator... Preparing DataLoader to fetch source images during the training phase... Preparing DataLoader to fetch reference images during the training phase... Preparing DataLoader for the generation phase... Start training...

The process gets killed without an error. However, when I set the num_workers=0 the code runs properly. I believe there is an issue with data_loaders on the number of workers. Can you suggest where to look for debugging?

usingnamespacestc commented 3 years ago

Monthes has past and dont know if you have already figured it out. Which system are you using? Have ever met the similiar problem on my windows laptop. However the code runs well on Colab. I think this happens when your computer system has different defination of multiprocessing.