Closed iPrayerr closed 4 years ago
@LoSealL Are you still maintaining the project? If so, I really need your help.
@iPrayerr Sorry, didn't get noticed. Too busy recently. I will check soon.
@iPrayerr Hi, The root
in datasets.yaml
means the root folder to your image/video data. In your case, the root
must be /
. Or you can set root=/dataset_folder
and CUSTOM-TRAINHR[video]: HR
Besides, there is a typo error in your names: CUSTOM_TRAINHR
and CUSTOM-TRAINHR
It usually occurs after one epoch, but sometimes the epoch_num can also be 2 or 3. Besides, totally my trainset has 240 videos or 240*100=24000 frames. However, it can just read 200 batchs when my batch_size is set to 4. I just wonder how to train with multiple videos as mentioned above? Cause when I just use one video folder containing 100 frames, everything is ok.
If I correct the data path, the training procedure is OK in my testing environment. If you meet any errors, please feel free to paste full logs here to help me debug.
In VSR training, for each epoch, the total number of batches if fixed by --steps
which by default is 200, no matter how many pictures in your dataset, because it randomly crops patches from them.
It usually occurs after one epoch, but sometimes the epoch_num can also be 2 or 3. Besides, totally my trainset has 240 videos or 240*100=24000 frames. However, it can just read 200 batchs when my batch_size is set to 4. I just wonder how to train with multiple videos as mentioned above? Cause when I just use one video folder containing 100 frames, everything is ok.
If I correct the data path, the training procedure is OK in my testing environment. If you meet any errors, please feel free to paste full logs here to help me debug.
In VSR training, for each epoch, the total number of batches if fixed by
--steps
which by default is 200, no matter how many pictures in your dataset, because it randomly crops patches from them.
OK, thanks a lot, I'll try. I'll paste more logs if new problems are raised.
@LoSealL Hi, I've followed your instructions previously which still didn't work.
Here're more of the details:
My environment: Hardware: 32G RAM, 12G GTX 1080Ti Software: Ubuntu 16.04, Python 3.6.5, tensorflow-gpu 1.10.0, tensorboardX 2.0, PyTorch 1.1.0, Protobuf 3.6.0
My dataset folders(actually it's REDS): train_hr: /data/ruan/REDS/train/train_sharp//.png train_lr: /data/ruan/REDS/train/train_sharp_bicubic//.png val_hr: /data/ruan/REDS/val/val_sharp//.png val_lr: /data/ruan/REDS/val/val_sharp_bicubic//.png Here the first "*" stands for various num of videos(from 000 to a max num), where the second stands for a specific single frame(from 00000000 to a max num).
Settings related to my dataset in Data/datasets.yaml: Root: /data/ruan/REDS
_Path: REDSTRAIN-HR[video]: train/train_sharp REDSTRAIN-LR[video]: train/train_sharp_bicubic REDSVAL-HR[video]: val/val_sharp REDSVAL-LR[video]: val/val_sharpbicubic
Dataset: REDS[video]: train: hr: REDSTRAIN-HR lr: REDSTRAIN-LR val: hr: REDSVAL-HR lr: REDSVAL-LR
_batch_shape: [2, 3, 1, 32, 32] lr: 1.0e-4 lr_decay: method: multistep decay_step: [250, 500, 750, 1000, 1250] decayrate: 0.5
(1) Set no memory_limit: _zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLEDEVICES=1 python train.py sofvsr --dataset reds --cuda --epochs 50 2020-04-15 04:07:04,621 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0 2020-04-15 04:07:05,914 INFO: LICENSE: SOF-VSR is implemented by Longguan Wang. @LongguanWang https://github.com/LongguangWang/SOF-VSR. 2020-04-15 04:07:14,027 INFO: Total params: 1639676 2020-04-15 04:07:14,028 WARNING: trying to restore state for optimizer opt, but failed. 2020-04-15 04:07:14,028 INFO: Fitting: [SOF] | 2020-04-15 04:09:25 | Epoch: 1/50 | LR: 0.0001 | 30%|####################6 | 59/200 [04:33<04:56, 2.10s/batch, image=00.09236, flow/lvl1=00.02955, flow/lvl2=00.02333, flow/lvl3=00.08839] Killed
(2) Set memory_limit:
_zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py sofvsr --dataset reds --cuda --epochs 50 --memorylimit 2GB
2020-04-15 04:29:45,743 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0
2020-04-15 04:30:29,301 INFO: LICENSE: SOF-VSR is implemented by Longguan Wang. @LongguanWang https://github.com/LongguangWang/SOF-VSR.
2020-04-15 04:31:20,773 INFO: Total params: 1639676
2020-04-15 04:31:20,775 WARNING: trying to restore state for optimizer opt, but failed.
2020-04-15 04:31:20,775 INFO: Fitting: [SOF]
| 2020-04-15 04:31:51 | Epoch: 1/50 | LR: 0.0001 |
100%|#####################################################################| 200/200 [05:31<00:00, 1.45s/batch, image=00.07591, flow/lvl1=00.06990, flow/lvl2=00.08228, flow/lvl3=00.10484]
| Epoch average image = 0.106827 |
| Epoch average flow/lvl1 = 0.048361 |
| Epoch average flow/lvl2 = 0.063642 |
| Epoch average flow/lvl3 = 0.105035 |
Test: 100%|################################################################################################################################################| 10/10 [00:39<00:00, 1.91s/it]
psnr: 12.935657,
| 2020-04-15 04:38:32 | Epoch: 2/50 | LR: 0.0001 |
100%|#####################################################################| 200/200 [05:10<00:00, 1.43s/batch, image=00.04348, flow/lvl1=00.05901, flow/lvl2=00.06701, flow/lvl3=00.07531]
| Epoch average image = 0.034973 |
| Epoch average flow/lvl1 = 0.040674 |
| Epoch average flow/lvl2 = 0.045621 |
| Epoch average flow/lvl3 = 0.059479 |
_Traceback (most recent call last):
File "train.py", line 99, in
In (2), whatever the --memory_limit
sets, the traceback is still the same.
Besides, when I again reload parameters to train further, at most it'll step for one epoch:
_zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py sofvsr --dataset reds --cuda --epochs 50 --memory_limit 6GB
2020-04-15 04:45:22,443 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0
2020-04-15 04:45:23,823 INFO: LICENSE: SOF-VSR is implemented by Longguan Wang. @LongguanWang https://github.com/LongguangWang/SOF-VSR.
2020-04-15 04:45:32,135 INFO: Total params: 1639676
2020-04-15 04:45:32,136 INFO: Restoring params for sof from /data/zp/Graduation/VideoSuperResolution-master/Results/sofvsr/save/sofep0001.pth.
2020-04-15 04:45:32,375 INFO: Fitting: [SOF]
| 2020-04-15 04:46:51 | Epoch: 2/50 | LR: 0.0001 |
100%|#####################################################################| 200/200 [06:21<00:00, 1.48s/batch, image=00.05740, flow/lvl1=00.06096, flow/lvl2=00.07924, flow/lvl3=00.09723]
| Epoch average image = 0.081686 |
| Epoch average flow/lvl1 = 0.055793 |
| Epoch average flow/lvl2 = 0.086770 |
| Epoch average flow/lvl3 = 0.147451 |
Test: 100%|################################################################################################################################################| 10/10 [00:39<00:00, 2.19s/it]
psnr: 12.557086,
_Traceback (most recent call last):
File "train.py", line 99, in
Hope that the above will be helpful for your debugging.
@iPrayerr I don't get how you arrange your training data
My dataset folders(actually it's REDS): train_hr: /data/ruan/REDS/train/train_sharp//.png train_lr: /data/ruan/REDS/train/train_sharp_bicubic//.png val_hr: /data/ruan/REDS/val/val_sharp//.png val_lr: /data/ruan/REDS/val/val_sharp_bicubic//.png Here the first "*" stands for various num of videos(from 000 to a max num), where the second stands for a specific single frame(from 00000000 to a max num).
Are they look like:
train_sharp/001.png train_sharp/002.png
train_sharp_bicubic/001.png train_sharp_bicubic/002.png
Then all images in the same folder in video mode will be treated as one video clip. In order to tell the dataloader the right video data, I'd like to arrange data like:
train_sharp/v01/001.png train_sharp/v01/002.png
train_sharp/v02/001.png train_sharp/v02/002.png
...
train_sharp_bicubic/v01/001.png train_sharp_bicubic/v01/002.png
train_sharp_bicubic/v02/001.png train_sharp_bicubic/v02/002.png
...
I don't even remember I added ssim in the validation! SSIM is too heavy, and it will significantly slow down the validation speed. So I usually calculate it offline.
BTW, using skimage.measure.compare_ssim
is very useful to get ssim and other metrics.
train_sharp/v01/001.png train_sharp/v01/002.png train_sharp/v02/001.png train_sharp/v02/002.png ... train_sharp_bicubic/v01/001.png train_sharp_bicubic/v01/002.png train_sharp_bicubic/v02/001.png train_sharp_bicubic/v02/002.png ...
@LoSealL Sorry, I've forgotten a "". Previously I meant `/data/ruan/REDS/train/train_sharp//.png, not
/data/ruan/REDS/train/train_sharp/.png`
Here the "*" stands for a specific number(video_num or frame_num), for example,
/data/ruan/REDS/train/train_sharp/007/00000000.png
or
/data/ruan/REDS/val/val_sharp_bicubic/013/00000099.png
The data is just like what you've been arranged. And still raised the problem I've mentioned before.
@iPrayerr You're right, this is a bug when enabling memory_limit
. Sorry for this, and I made a patch to fix it.
Thanks a lot. I've just tested, seems that everything's ok now.
@LoSealL Hi, sorry to bother you again. Now when the above problem was solved, I found that both psnr and ssim are quite low in many algorithm's testing process after training(ssim is added by myself and doesn't effect the training process). I've tried 6 algorithms and all of them are of the same problem.
Here're two examples:
zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py vespcn --dataset reds --epochs 35 --steps 8000 --val_steps 300 --cuda --memory_limit 1.5GB 2020-04-17 06:03:20,380 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0 2020-04-17 06:03:32,087 INFO: LICENSE: VESPCN is proposed at CVPR2017 by Twitter. Implemented by myself @LoSealL. 2020-04-17 06:03:55,234 INFO: Total params: 878787 2020-04-17 06:03:55,235 WARNING: trying to restore state for optimizer opt, but failed. 2020-04-17 06:03:55,235 INFO: Fitting: [VESPCN] | 2020-04-17 06:04:20 | Epoch: 1/35 | LR: 0.0001 | 100%|#################################################################################################| 8000/8000 [1:56:12<00:00, 1.19batch/s, image=00.02813, flow=00.05529, tv=00.09024] | Epoch average image = 0.056315 | | Epoch average flow = 0.040478 | | Epoch average tv = 0.330291 | Test: 100%|##############################################################################################################################################| 300/300 [05:33<00:00, 1.08it/s] psnr: 12.205831, ssim: 0.483721, | 2020-04-17 08:06:16 | Epoch: 2/35 | LR: 0.0001 | 100%|#################################################################################################| 8000/8000 [1:49:09<00:00, 1.25batch/s, image=00.09327, flow=00.03729, tv=00.04959] | Epoch average image = 0.052351 | | Epoch average flow = 0.035497 | | Epoch average tv = 0.070800 | Test: 100%|##############################################################################################################################################| 300/300 [05:32<00:00, 1.00s/it] psnr: 11.771148, ssim: 0.445891,
zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ CUDA_VISIBLE_DEVICES=1 python train.py rbpn --dataset reds --epochs 35 --steps 8000 --val_steps 300 --cuda --memory_limit 2GB 2020-04-17 07:44:38,519 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0 2020-04-17 07:44:40,218 INFO: LICENSE: RBPN is implemented by M. Haris, et. al. @alterzero 2020-04-17 07:44:40,218 WARNING: I use unsupervised flownet to estimate optical flow, rather than pyflow module. 2020-04-17 07:44:47,539 INFO: Total params: 14510537 2020-04-17 07:44:47,540 WARNING: trying to restore state for optimizer adam, but failed. 2020-04-17 07:44:47,540 INFO: Fitting: [RBPN] | 2020-04-17 07:45:05 | Epoch: 1/35 | LR: 0.0001 | 100%|##############################################################################################| 8000/8000 [1:16:33<00:00, 1.89batch/s, flow=00.02468, image=00.19080, total=00.21548] | Epoch average flow = 0.131890 | | Epoch average image = 0.258228 | | Epoch average total = 0.390117 | Test: 100%|##############################################################################################################################################| 300/300 [02:50<00:00, 2.13it/s] psnr: 12.385075, ssim: 0.432265, | 2020-04-17 09:04:39 | Epoch: 2/35 | LR: 0.0001 | 100%|##############################################################################################| 8000/8000 [1:19:01<00:00, 1.94batch/s, flow=00.17723, image=00.29885, total=00.47607] | Epoch average flow = 0.159079 | | Epoch average image = 0.214459 | | Epoch average total = 0.373538 | Test: 100%|##############################################################################################################################################| 300/300 [02:49<00:00, 2.09it/s] psnr: 12.186343, ssim: 0.446204,
I just crop 2 epochs as example, as more than 2 can lead to the same result as well(srcnn has been trained over 37 epochs or 296000 iterations and still found the same problem).
However, for some algorithms like SOF-VSR, when I used a single video as training data and set a value of --steps
, it could reach to a normal level as around 24 or 26 psnr value, while for others like SRCNN the problem remains.
I guess there may be something wrong with several hyper-parameters. Could you offer me some suggestions?
Thx. :)
Usually 2 ways to debug:
I didn't train sof-vsr from scratch, you can check the paper and my implementation carefully. I just fine-tuned sof-vsr above the official pre-trained weight and the result is expected to me.
OK, I'll try.
@LoSealL Hello, I've seen your latest comment named "Fix dataloader mess up the file order".
I've tested it in the previous dataset and treated it as a image dataset to test CARN. However, nothing has been fixed at all.
Specifically, in test process, I separatelly saved LR, GT and CARN's SR results. Still, some of them are matched, while most aren't, which is the same as the previous version.
However, when I run check_dataset.py
, it didn't find any unmatched pair:
zp@HP-Z840:/data/zp/Graduation/VideoSuperResolution-master/Train$ python check_dataset.py redimg 2020-04-23 00:51:37,680 WARNING: [!] PyTorch version too low: 1.1.0, recommended 1.2.0 Dataset: REDIMG
========= CHECKING train =========
Found train
set in "REDIMG":
Found 24000 ground-truth train data
Found 24000 custom degraded train data
========= CHECKING val =========
Found val
set in "REDIMG":
Found 3000 ground-truth val data
Found 3000 custom degraded val data
========= CHECKING test =========
REDIMG doesn't contain any test data.
Do you know what the problem is?
@iPrayerr Affirmative. It's a bug. To work around it, use --threads=1
. It takes me some time to fix this :(
Hello, I've tried to train with my own dataset whose folder is as below:
/dataset_folder/HR/video_num/*.png
/dataset_folder/LR/X4/video_num/*.png
And I've organized them following the instructions in Data/datasets.yaml:
Root: /home/user
Path:
CUSTOM-TRAINHR[video]: dataset_folder/HR
CUSTOM-TRAINLR[video]: dataset_folder/LR/X4
Dataset:
CUSTOM[video]:
train:
hr: CUSTOM_TRAINHR
lr: CUSTOM_TRAINLR
my valset is the same format as organized above.However, Traceback was thown when I tried the command:
python train.py sofvsr --dataset custom --epochs 100 --cuda
which is:Traceback (most recent call last):
File "train.py", line 99, in <module>
main()
File "train.py", line 93, in main
t.fit([lt, lv], config)
File "/home/zp/VideoSuperResolution-master/VSR/Backend/Torch/Framework/Trainer.py", line 110, in fit
memory_limit=mem)
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/Loader.py", line 322, in make_one_shot_iterator
raise fs.exception()
File "/home/user/.conda/envs/zp/lib/python3.6/concurrent/futures/thread.py", line 56, in run result = self.fn(*self.args, **self.kwargs)
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/Loader.py", line 393, in _prefecth_chunk
self.cache['hr'].append(img.read_frame(img.frames))
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 362, in read_frame
image_bytes = [BytesIO(self.read()) for _ in range(frames)]
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 362, in <listcomp>
image_bytes = [BytesIO(self.read()) for _ in range(frames)]
File "/home/zp/VideoSuperResolution-master/VSR/DataLoader/VirtualFile.py", line 129, in read raise EOFError(f'End of File! {self.name}')
EOFError: End of File! 068
It usually occurs after one epoch, but sometimes the epoch_num can also be 2 or 3. Besides, totally my trainset has 240 videos or 240*100=24000 frames. However, it can just read 200 batchs when my batch_size is set to 4.
I just wonder how to train with multiple videos as mentioned above? Cause when I just use one video folder containing 100 frames, everything is ok.
Thx.