MultiSaver - Githubissues

jundongAI commented 4 years ago

Hi Nah！ When I tested the model，the following question occured. C:\Anaconda\python.exe D:/bishe/DeepDeblur-PyTorch-master/src/main.py ===> Loading test dataset: GOPRO_Large Loading model from ../experiment\2020-07-09_13-41-50\models\model-2.pt Loading optimizer from ../experiment\2020-07-09_13-41-50\optim\optim-2.pt Loss function: 1L1+1ADV Metrics: PSNR,SSIM Loading loss record from ../experiment\2020-07-09_13-41-50\loss.pt ===> Initializing trainer results are saved in ../experiment\2020-07-09_13-41-50\result Loading model from ../experiment\2020-07-09_13-41-50\models\model-2.pt Loading optimizer from ../experiment\2020-07-09_13-41-50\optim\optim-2.pt Loading loss record from ../experiment\2020-07-09_13-41-50\loss.pt Test Loss: 34.6 PSNR: 23.42 SSIM: 0.6894: | | 1/100 [00:48<1:20:08, 48.57s/it]Can't pickle local object 'MultiSaver.begin_background..t' Traceback (most recent call last): File "", line 1, in File "C:\Anaconda\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "C:\Anaconda\lib\multiprocessing\spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input Test Loss: 34.4 PSNR: 23.64 SSIM: 0.7096: | | 1/100 [01:04<1:46:44, 64.69s/it] Traceback (most recent call last): File "D:/bishe/DeepDeblur-PyTorch-master/src/main.py", line 69, in main() File "D:/bishe/DeepDeblur-PyTorch-master/src/main.py", line 66, in main main_worker(args.rank, args) File "D:/bishe/DeepDeblur-PyTorch-master/src/main.py", line 61, in main_worker trainer.imsaver.join_background() File "D:\bishe\DeepDeblur-PyTorch-master\src\utils.py", line 131, in join_background p.join() File "C:\Anaconda\lib\multiprocessing\process.py", line 123, in join assert self._popen is not None, 'can only join a started process' AssertionError: can only join a started process Could you tell me how to deal with it?

SeungjunNah commented 4 years ago

MultiSaver relies on python multiprocessing to parallelize image saving process. It might be the Windows OS that is causing this error (not sure). python multiprocessing handles child processes in different manners depending on whether the OS is Windows or Unix. I never had such errors when testing on Ubuntu LTS systems. (14.04 ~ 18.04)

To isolate the issue, could you try running the code with Linux machines? I don't have a Windows GPU machine to test if the OS is the cause.

jundongAI commented 4 years ago

Hi Nah！ I took your advice，tried running the code with Linux machine(Ubuntu 18.04) and uesd the code you updated at Github five days ago. But when I tested the model with the Ubuntu 18.04, the message on the screen is as following：

jundong@jundong-X550VQ:~/DeepDeblur-PyTorch-master2/src$ python main.py ===> Loading test dataset: GOPRO_Large Loading model from ../experiment/2020-07-09_13-41-50/models/model-2.pt Loading optimizer from ../experiment/2020-07-09_13-41-50/optim/optim-2.pt Loss function: 1L1+1ADV Metrics: PSNR,SSIM Loading loss record from ../experiment/2020-07-09_13-41-50/loss.pt ===> Initializing trainer results are saved in ../experiment/2020-07-09_13-41-50/result Loading model from ../experiment/2020-07-09_13-41-50/models/model-2.pt Loading optimizer from ../experiment/2020-07-09_13-41-50/optim/optim-2.pt Loading loss record from ../experiment/2020-07-09_13-41-50/loss.pt | | 0/100 [00:01<?, ?it/s]

Then it stopped，without any information.

When I looked at folder “../experiment/2020-07-09_13-41-50/result”, I found it was empty. I also run the updated code with Windows machine，the situation is the same as Ubuntu system. I am very confused about the problem. The attached is my experiment file，could you help me to deal with it？

------------------ 原始邮件 ------------------ 发件人: "SeungjunNah/DeepDeblur-PyTorch" <notifications@github.com>; 发送时间: 2020年7月9日(星期四) 下午3:07 收件人: "SeungjunNah/DeepDeblur-PyTorch"<DeepDeblur-PyTorch@noreply.github.com>; 抄送: "默语"<838896921@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [SeungjunNah/DeepDeblur-PyTorch] MultiSaver (#8)

MultiSaver relies on python multiprocessing to parallelize image saving process. It might be the Windows OS that is causing this error (not sure). python multiprocessing handles child processes in different manners depending on whether the OS is Windows or Unix. I never had such errors when testing on Ubuntu LTS systems. (14.04 ~ 18.04)

To isolate the issue, could you try running the code with Linux machines? I don't have a Windows GPU machine to test if the OS is the cause.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

                    从QQ邮箱发来的超大附件            

                 DeepDeblur-PyTorch-master.zip (1.09G, 2020年8月11日 17:07 到期)                                         进入下载页面：http://mail.qq.com/cgi-bin/ftnExs_download?k=37313338606561d97cbc330a1166544c5c54550851025102480607590e4b520704091e000602554e560005590602055151010a5a377b66270054437c52040a16171c6341630914000d1c5e59441203114b4b5a48375b&t=exs_ftn_download&code=e1387ffc&from=mobile&expiretime=1597135046&fsize=1.09G&suffix=zip

SeungjunNah commented 4 years ago

Hi @jundongAI,

Did you change anything from the code and the dataset? I assume you did. Your error log shows unexpected code behaviors. If you run the code as is (as shown by python main.py in your reply),

It should load train dataset before test dataset
the loss function should be L1.
trainer should be initialized without loading a model and optimizer as the save_dir is not specified in the input command.
The progress bar should show maximum iteration number 262 for batch size 8 and 131 for batch size 16 but yours show 100.

Below is an example terminal log that should be shown by running the code.

python main.py --batch_size 8
===> Loading train dataset: GOPRO_Large
===> Loading test dataset: GOPRO_Large
Loss function: 1*L1
Metrics: PSNR,SSIM
===> Initializing trainer
results are saved in ../experiment/2020-07-14_00-52-58/result
[Epoch 1 / lr 1.00e-04]
Train Loss: 53.2: |████▌                       | 43/262 [01:10<05:58,  1.64s/it]

If you want me to investigate your code, do not directly reply to the GitHub notification email and leave an email to me directly. Also, please describe the modifications if you made any.

jundongAI commented 4 years ago

Hi Nah！ Thank you very much for your unselfish help recently. I found the key information in your last email that “trainer should be initialized without loading a model and optimizer as the save_dir is not specified in the input command.” I use “python main.py --n_GPUs 1 --batch_size 1 --num_workers 2--save_dir GOPRO_L1_ADV” command to run the code，It works normally without any errors. Because of the limited computing power of my computer,I did change parts of the code,such as batch size、num_workers at all.Besdides， I added a line of code to the options.py about the "save_dir" as the follwing pictire:

As my statement is not comprehensive enough, you may have spent more energy to deal with my problem. I sincerely apologize for this. Finally, I would like to thank you sincerely again for your help！

------------------ 原始邮件 ------------------ 发件人: "SeungjunNah/DeepDeblur-PyTorch" <notifications@github.com>; 发送时间: 2020年7月14日(星期二) 凌晨0:05 收件人: "SeungjunNah/DeepDeblur-PyTorch"<DeepDeblur-PyTorch@noreply.github.com>; 抄送: "默语"<838896921@qq.com>;"Mention"<mention@noreply.github.com>; 主题: Re: [SeungjunNah/DeepDeblur-PyTorch] MultiSaver (#8)

Hi @jundongAI,

Did you change anything from the code and the dataset? I assume you did. Your error log shows unexpected code behaviors. If you run the code as is (as shown by python main.py in your reply),

It should load train dataset before test dataset

the loss function should be L1.

trainer should be initialized without loading a model and optimizer as the save_dir is not specified in the input command.

The progress bar should show maximum iteration number 262 for batch size 8 and 131 for batch size 4, not 100.

Below is an example terminal log that should be shown by running the code. python main.py --batch_size 8 ===> Loading train dataset: GOPRO_Large ===> Loading test dataset: GOPRO_Large Loss function: 1*L1 Metrics: PSNR,SSIM ===> Initializing trainer results are saved in ../experiment/2020-07-14_00-52-58/result [Epoch 1 / lr 1.00e-04] Train Loss: 53.2: |████▌ | 43/262 [01:10<05:58, 1.64s/it]

If you want me to investigate your code, do not directly reply to the GitHub notification email and leave an email to me directly. Also, please describe the modifications if you made any.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

SeungjunNah commented 4 years ago

Hi @jundongAI

If you want to train with adversarial loss, you should specify the loss function in the input command. Otherwise, the discriminator won't be declared.

Here are some sample commands that are shown in the usage examples.

# adversarial training
python main.py --n_GPUs 1 --batch_size 8 --loss 1*L1+1*ADV
python main.py --n_GPUs 1 --batch_size 8 --loss 1*L1+3*ADV
python main.py --n_GPUs 1 --batch_size 8 --loss 1*L1+0.1*ADV

You may want to change the commands as below:

python main.py --n_GPUs 1 --batch_size 1 --num_workers 2 --loss 1*L1+1*ADV --save_dir GOPRO_L1_ADV
python main.py --n_GPUs 1 --batch_size 1 --num_workers 2 --loss 1*L1+3*ADV --save_dir GOPRO_L1_ADV
python main.py --n_GPUs 1 --batch_size 1 --num_workers 2 --loss 1*L1+0.1*ADV --save_dir GOPRO_L1_ADV

SeungjunNah / DeepDeblur-PyTorch

MultiSaver #8