Open splinter21 opened 3 years ago
I have to find where I put the script to originally generate the dataset, but the dataset generation is very slow compared to the speed of training. It is usually preferable to pre-generate or pre-augment the dataset when using large batch sizes to avoid problems. I figured that 2000 images should be enough, especially if used with augmentation (1 original + 7 flipping/rotations + shearing, scaling, etc...). You can get up to 100000+ images if you use augmentation.
I noticed that you trained the denoising model with 4 datasets, but trained the upsacler with only SYNLA-Dataset, why?
I have to find where I put the script to originally generate the dataset, but the dataset generation is very slow compared to the speed of training. It is usually preferable to pre-generate or pre-augment the dataset when using large batch sizes to avoid problems. I figured that 2000 images should be enough, especially if used with augmentation (1 original + 7 flipping/rotations + shearing, scaling, etc...). You can get up to 100000+ images if you use augmentation.
What scaling method do you use (for augmentation)? I'm worried that improper scaling (e.g. bicubic upsampling)algorithms will also affect performance.
I noticed that you trained the denoising model with 4 datasets, but trained the upsacler with only SYNLA-Dataset, why?
The downscaling operation mainly affects salient features (i.e. edges, textures), which the network must learn to reconstruct. The SYNLA dataset is optimized for edges, lines and textures and is appropriate for super-resolution. However, compression noise in images/videos is highly non-linear and does not follow the hypothesis stated above. In fact most compression methods try to preserve salient features and discard other features that are less visible. The goal of denoising and upscaling is in fact different. Using SYNLA only for denoising would add a large amount of training bias to the network.
What scaling method do you use (for augmentation)? I'm worried that improper scaling (e.g. bicubic upsampling)algorithms will also affect performance.
Integer scaling with average pooling should be the perfect operation that preserves the distribution of the images. With 256x256 images, you can scale them to 128x128, and by cropping the four quadrants of 256x256 images you get 4 128x128 images, plus the downscaled 128x128. Shearing should also be done in integer ratios.
If you wish to use bilinear resampling (for non-integer ratios between 0.5 and 2), simply downscale by x2 with average pooling at the end to preserve the distribution of the images.
Non-linear resampling methods are to be avoided unless you want to specially train the network to remove these specific non-linear artifacts! Natural low-resolution images do not exhibit non-linear artifacts such as ringing or overshoot.
I would also add that super-resolution and denoising are polar opposites, super-resolution tries to enhance details, while denoising tries to remove noise, thus might remove details. I can also argue that trying to do super-resolution and denoising at the same time is in fact extremely difficult, even impossible in the general case (if given an arbitrary downscaling operation and noise distribution).
I noticed that you trained the denoising model with 4 datasets, but trained the upsacler with only SYNLA-Dataset, why?
The downscaling operation mainly affects salient features (i.e. edges, textures), which the network must learn to reconstruct. The SYNLA dataset is optimized for edges, lines and textures and is appropriate for super-resolution. However, compression noise in images/videos is highly non-linear and does not follow the hypothesis stated above. In fact most compression methods try to preserve salient features and discard other features that are less visible. The goal of denoising and upscaling is in fact different. Using SYNLA only for denoising would add a large amount of training bias to the network.
What scaling method do you use (for augmentation)? I'm worried that improper scaling (e.g. bicubic upsampling)algorithms will also affect performance.
Integer scaling with average pooling should be the perfect operation that preserves the distribution of the images. With 256x256 images, you can scale them to 128x128, and by cropping the four quadrants of 256x256 images you get 4 128x128 images, plus the downscaled 128x128. Shearing should also be done in integer ratios.
If you wish to use bilinear resampling (for non-integer ratios between 0.5 and 2), simply downscale by x2 with average pooling at the end to preserve the distribution of the images.
Non-linear resampling methods are to be avoided unless you want to specially train the network to remove these specific non-linear artifacts! Natural low-resolution images do not exhibit non-linear artifacts such as ringing or overshoot.
I got it. If the goal is to upscale a natural video to larger resolution, average pooling is ok. 0.5x(avg)->1x(train) vs 1x->2x(no ground truth)(test). But if the foal is to reconstruct the degredated video, e.g., raw is 1x, but we can only get the 0.5x one, while we want to watch the 1x one, we should use various downsampling methods to simulate the 0.5x degredated video for training. And Anime4K is the former.
I got it. If the goal is to upscale a natural video to larger resolution, average pooling is ok. 0.5x(avg)->1x(train) vs 1x->2x(no ground truth)(test). But if the foal is to reconstruct the degredated video, e.g., raw is 1x, but we can only get the 0.5x one, while we want to watch the 1x one, we should use various downsampling methods to simulate the 0.5x degredated video for training. And Anime4K is the former.
Exactly, you will notice that waifu2x works very well on degraded video, but does almost nothing to a lot of 1080p anime. (Try it yourself, and compare the results). Anime4K includes more than just super-resolution, and is specially tailored to preserve the edge sharpness when upsampling 1080p anime to 4K.
I got it. If the goal is to upscale a natural video to larger resolution, average pooling is ok. 0.5x(avg)->1x(train) vs 1x->2x(no ground truth)(test). But if the foal is to reconstruct the degredated video, e.g., raw is 1x, but we can only get the 0.5x one, while we want to watch the 1x one, we should use various downsampling methods to simulate the 0.5x degredated video for training. And Anime4K is the former.
Exactly, you will notice that waifu2x works very well on degraded video, but does almost nothing to a lot of 1080p anime. (Try it yourself, and compare the results). Anime4K includes more than just super-resolution, and is specially tailored to preserve the edge sharpness when upsampling 1080p anime to 4K.
What's the test set(benchmark) do you use to compare Waifu2x, Anime4K, and etc.? I'm going to train a model and study something about anime video upscaling.
I think when training, it's better to dynamiclly generate various line style images instead of using the generated 2000+ dataset.