Practical use of ESRGAN in video game upscaling

SirIzumi commented 5 years ago

@xinntao Sorry for creating a new issue, but it is completely different topic I want to address now.

I am not sure if you are aware or not, but during last 2-3 month your ESRGAN / BasicSR projects became HUGELY popular in video gaming communities. There are literally thousands of people, maybe hundreds of thousand who are using ESRGAN to make or/and enjoy old video games textures, sprites and other images look HD and new, with varying degree of success. There is a whole Reddit subforum dedicated for this cause, particulary this and this threads dedicated to ESRGAN running and training. And, by the way, there are many people there with 0 to little knowledge of python and machine learning, so that's not so easy for us to get our head into.

But since the start we encountered some problem, which I will quote:

The problem with most game sprite upscaling is that the AI system assumes that the images it gets are downscaled and it needs to recover detail that used to be there. Of course, on lots of game sprites this doesn't work since they were simply made at that size. The small hints of remainders of details that it looks for in downscaled photos simply don't exist there.

Actually I can translate it as "We don't have downscaled crops from images that we need to recover information, we have complete non-downscaled images we need to upscale with the best quality possible".

So I want to ask you directly, as the author and the person who knows ESRGAN the best in the world: How can we achieve what we want? Maybe we need to tweak specific options, maybe there are some secrets for preparing training datasets, or something else?

Regarding this I had an idea, which is making or HR images are as big as 512x512 or even bigger, taking the full-sized textures with no crop/tiling, by that we can use LR at least 128x128 with lots of pixel information or even bigger in a case of 2x-3x models, but with low batch_size, maybe set to 2 or even 1! Could this work for us, game modders? As far as I understand, we will need to not only hr_size to 512, but also set up custom "which_model_D": "discriminator_vgg_128" for there are only 128 and 192 options in the code, not sure if it's possible to add more options. P.S. regarding discriminator: there is a Discriminator_VGG_128_SN which uses Spectral Normalization, but no 192_SN for some reason. Is it usable, and what's the difference?

Or maybe you have ability to point us to direction where to work with all that texture/sprite upscaling? Currently among the hundreds of tries to train a model we have only a SINGLE ONE based of Manga109 which is used for all the work people to. And noone has an idea how and why it works when others does not!

And once again, thanks a lot from me and many many many other people for your work!

xinntao commented 5 years ago

Hi @SirIzumi , Thank you very much for creating this "issue". It gives much information about Super-Resolution (SR) in the video gaming community. And I am very pleased to see that some of our work can be applied to real-world applications. But as you said, there are still some differences between the research problem and the real-world problem (like upsampling the game textures).

ESRGAN and other models in BasicSR are actually trained for a known kernel (having the high resolution image and down-sampling it with a known kernel). So, if directly applying to other images with different kernels or even images whose high-resolution counterpart is not available (as the problem in the video gaming), the results may have noisy artifacts.

Usually, for this case, using some related dataset to fine-tuning the ESRGAN models will help a bit. For example, you can collect the images from games or similar images (like, anime style imagery). And then you downsample them with known kernels (usually we use bicubic kernel with anti-aliasing). Fine-tuning the model with these image pairs are helpful. Because i) the statistics for natural images and gaming images are different; 2) the ESRGAN is now trained with natural images; 3) if ESRGAN can capture the statistics of gaming images, it will help to improve. (I think this is why the Manga109 model could work but the original ESRGAN failed.)

Another problem may arise from the noise. ESRGAN actually has no ability to denoise. But the fantastic tool Waifu2x also has the denoising ability, which is very useful for anime style imagery (but may hurt the textures). Maybe we can add denosing component to ESRGAN. (BTW, you can adjust the model with network interopolation (see ESRGAN repo))

As for using large patches to training, from my experiences, training with 128 for natural images seems OK enough. Larger training patches may cause more training time (especially for training D). Discriminator_VGG_128_SN does not improve the performance so I do no use it.

I am very new to video gaming and SR for video gaming. But I am very glad to see that our work can help video gaming. If you have any problem, do not hesitate to ask here and I will try my best to give my opinions. (I will try my best to reply in time. But I am a bit busy in Feburary +_+||) I will look the SR problem in video gaming deeper after I finish my project.

SirIzumi commented 5 years ago

Wow, thanks for the complete answer! If you are really will be interested for real-world apps (games) using your tool, that'd be awesome! BTW, ERSGAN combined with modified Manga109 dataset by kingdomakrillic (double JPG conversion for HR, triple JPG for LR) provides consistent better results that ANY free or paid software, including Topaz AI Gigapixel, waifu2x, letsenhance.io and many others. Except maybe Nvidia GMT, which is in closed beta and unavailable for people to compare except big developers. Actually just googling "ESRGAN" you will have a lot of links, news and threads which says for itself how effective it is.

Some more questions if you don't mind:

Does your statement mean that any method or creating LR images except bicubic downsampling will be useless even if training from scratch? Some games have low-res textures with high-res "original" ones, and low-res look like they were downsampled with nearest neighbor, that means they can't be used as LR and high-res as HR?
Is even worth it to fine-tune PSNR/ESRGAN default models with images/textures/art being so different from those 100 images you used for initial training, instead of going to full restart?
You are right, some images can become noisy, especially if they had compression artifacts. But there are quite a lot 3-rd party software to deal with the noise. But if you will be able to make self-learning denoiser built in the models, that be just very cool.
Maybe you could provide for us some insight about different options? kingdomakrillic shared some ideas like "increasing pixel_weight give images mosaic look, tune it down", but there is a lot more: how to prepare dataset, gan_weight and l1 and l2 criterions, and general ideas how to set lr_D and lr_G for different situations.

P.S. Tried to train a model with _SN discriminator, it really did not help much.

xinntao commented 5 years ago

Thanks for your information.

No. What I mean is that the pre-trained ESRGAN model (provided in the ESRGAN GitHub repo) is trained for bicubic downsampling kernel. If we want to upsample images with some other kernels, we'd better train/fine-tune another model specifically for the corresponding kernel, to achieve better visual results. For gaming, if we could collect some LR and HR pairs for training/fine-tuning, it would improve the results.
Fine-tuning is a useful technique in deep learning. Many other tasks use pre-trained models to improve the performance, even they are for different tasks. Fine-tuning from pretrained ESRGAN will shorten the training time (you can decrease the training iteration (modifing the learning scheme)) and get better results, than training from scratch. [You can use both the provide G and D for ESRGAN to have a try.]
Later, I think I can write some simple instructions.

SirIzumi commented 5 years ago

Model with Dataset where LR were made with Nearest Neighbor downsampling is producing very blurry results compared to the same Dataset with LR made with Bicubic. I thought it's because of coding implementation of upscaling algorithms.

2-3. We will wait and be patient then! I think there will be no readme/FAQ from anyone else.

xiaozhi2015 commented 5 years ago

@SirIzumi Can you share the running speed in your "Practical use of ESRGAN in video game upscaling"? In my experiment, transferring a 1024 576 image to 4096 2304 image takes 3s in a Tesla P40 GPU.

XPixelGroup / BasicSR

Practical use of ESRGAN in video game upscaling #69