##About train - Githubissues

shuangliumax commented 4 years ago

hello,may I ask how to set the train step?I trainded at step1 every time,and after 15000 iterations，the loss was NAN.How to set step = 2 and step = 3 ?

DeokyunKim commented 4 years ago

The hyper-parameter such as weights of the loss function, the number of iterations in each step, batch-size etc. is adjusted in our training dataset. You should adjust each parameter especially weights of loss function I think. If you have more questions, please reply to me here. Thank you!

shuangliumax commented 4 years ago

@DeokyunKim I tried to change my iteration step to carry out different training step,but this problem occurred during step2. File "/home/img/Desktop/Progressive-Face-Super-Resolution-master/face_alignment/utils.py", line 148, in get_predsfromhm preds[..., 0].apply(lambda x: (x - 1) % hm.size(3) + 1) TypeError: apply_ is only implemented on CPU tensors I found the reason that '_apply' function only used on cpu tensor ,but how to solve this problem???

DeokyunKim commented 4 years ago

you should replace the corresponding codes as bellow.

    idx += 1

    preds = idx.view(idx.size(0), idx.size(1), 1).repeat(1, 1, 2).float()

    #preds[..., 0].apply_(lambda x: (x - 1) % hm.size(3) + 1)

    preds[..., 0] = (preds[..., 0] -1)%hm.size(3) +1

    preds[..., 1].add_(-1).div_(hm.size(2)).floor_().add_(1)

shuangliumax commented 4 years ago

Hello, I trained this network and did the test. At step3, the best PSNR tested was only 10.48，I had the same number of training iterations as you,50K, 50K,100K, and the learning rate decreased by an order of magnitude，Why is PSNR so low？？

DeokyunKim commented 4 years ago

The pixel loss function is changed from L1 Loss to MSE Loss. Also, we are retraining the network using data augmentation as I uploaded. We are planning to upload the trained model for test.

DeokyunKim commented 4 years ago

Dear liushuangmax, thank you for your feedback. we also found missing code (apex.parallel.convert_syncbn_model).

All of the experiments in our paper done on the single GPU environment. However, we added the distributed training code for the various environment of researchers. We are retraining the face SR network using batch sync, and then we will upload retraining model weights.

Once again, thank you for your feedback. If you have more question, please reply to me here.

DeokyunKim commented 4 years ago

Hello, I trained this network and did the test. At step3, the best PSNR tested was only 10.48，I had the same number of training iterations as you,50K, 50K,100K, and the learning rate decreased by an order of magnitude，Why is PSNR so low？？

Did you train the network with apex parallel? The test code also should be modified such as load model, dataloader, etc. I uploaded the apex version temporarily. Please refer to that.

shuangliumax commented 4 years ago

@DeokyunKim hello,I retrained according to the code you provided again, and the PSNR was still only about 10.41 when I tested step2 and step3 in parallel training,that why ???I don't know what went wrong. I'm confused

DeokyunKim commented 4 years ago

What kind of dataset did you use?

shuangliumax commented 4 years ago

@DeokyunKim celebA--img_align Same dataset you used.Besides, I haven't made any changes except batch size from 16 to 32.

DeokyunKim commented 4 years ago

Could you test the uploaded model (generator_checkpoint_singleGPU.ckpt) as follow?

python eveal.py --data-path your/datapath/ --checkpoint-path ./checkpoints/generator_checkpoint_singleGPU.ckpt

shuangliumax commented 4 years ago

@DeokyunKim yes ,I have tested the model you provided, PSNR is 22.3944. So I was more puzzled why my training model test results were so low

DeokyunKim commented 4 years ago

Did you use the dataloader I provided? You should normalize all of the images using torchvision.transforms.Normalized([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]). And please show your training code.

DeokyunKim commented 4 years ago

Are there any changes to the code? please show me changes in any part of the code.

DeokyunKim commented 4 years ago

Your dataset seems to be wrong. There is no face region in the images.

shuangliumax commented 4 years ago

If the problem is dataset, then I execute the model_singleGPU you provided, why is there no error

DeokyunKim commented 4 years ago

The distributed trained model is operated by apex code published by NVIDIA. If you want to run the apex distribute trained model (generator_checkpoint.ckpt) I uploaded, you should use python -m torch.distributed.launch --nproc_per_node=4 eval.py \ --distributed \ --data-path './dataset' \ --checkpoint-path 'checkpoints/generator_checkpoint.ckpt'

DeokyunKim / Progressive-Face-Super-Resolution

##About train #3