Bad results when retraining ESRGAN from scratch and retrain with pre-train weights

ducnzo commented 3 months ago

Hi team members and everyone, First of all, thanks for interesting git. I have some problems that are detailed below:

Train from scratch. I'm attempting to train ESRGAN from scratch using Cliploss, but the results are as poor as the picture given in detail in the link. At 260000th iter, it turns purple and black.

e11d9d525397f7c9ae8610

Train with pre-trained weights. I utilized net_d and net_g weights as pre-train weights to train ESRGAN model using Cliploss. However, the situation appears to be growing worse. The best measurement is always obtained in the first 1000 iterations.

45d4cff84a3dee63b72c18

Detailed configuration
- Dataset consists of 2500 LR-HR pairs (Sentinel 10m - GG Satellite 2.5m)
- Config cliploss_esrgan_s2naip_urban.yml
- Number of images input: 8 images
- GPU 1660 Super 6GB, Batch size = 6
- Valid each 1000 iters

I have some questions.

Why does training from scratch provide such poor results?
Why does training with pre-trained weights seem to go worse as you trainmore?
Could you perhaps provide me some recommendations on how to address the aforesaid problem?
And the last question: What is your hardware configuration to successfully train the S2-NAIP model and how long is the training time?

Thank you all for your supports

piperwolters commented 3 months ago

Hi, thanks for your interest in our work!

1) My first reaction to training from scratch - 2500 LR/HR pairs is likely not enough data and 260k iterations is not enough training time. Most of my experiments were trained on ~1.3 million LR/HR pairs and I trained for over one million iterations. I did see similar artifacts, to the results you are showing, in the first half of training.

2) Could you try this finetuning experiment without CLIPLoss? The net_g and net_d weights I emailed you did not utilize CLIPLoss. Also, it would be interesting to see loss plots if you have them. I successfully finetuned the S2NAIP model on WorldStrat and got good results, so transferring to a different high-res imagery source should hopefully work after some finetuning of hyperparameters.

The baseline ESRGAN model with 8 input Sentinel-2 images, a batch size of 8, and 1 A6000 GPU was trained from scratch in ~12 days. One could train on multiple gpus to speed this up.

duyquang392 commented 3 months ago

I'm having a similar issue.

piperwolters commented 3 months ago

@duyquang392 Hi, could you provide more information?

duyquang392 commented 3 months ago

I'm building a dataset where I take 1 Sentinel image. My question is: Can I make timestamp images for LR images using my method, or is it necessary to use 8 images in varying time ranges? (at the same location). And why you use different time range for the data?

piperwolters commented 3 months ago

I'm building a dataset where I take 1 Sentinel image. My question is: Can I make timestamp images for LR images using my method, or is it necessary to use 8 images in varying time ranges? (at the same location). And why you use different time range for the data?

You can use a model that was only trained on 1 image, but I saw the best balance between performance and efficiency when the model took in 8 images. I saw more hallucinations and blurrier quality in generated images with my 1S2 model.

In general, the idea with using multiple images from the same location but different times - satellite imagery at one location will have sub-pixel differences between each image that is captured, and by providing multiple different images to the model, it has the opportunity to piece together different information into the final output image.

If you only have 1 Sentinel image in your dataset, you should try 1) the 1S2 model with that image and 2) the 8S2 model with that image repeated 8 times. I am not sure which would generate better imagery.

For the 8S2 model, the 8 images were chosen within 1.5 months of the target high-res image, and I tried to filter out images that were overly cloudy and partially invalid (not captured). There are seasonal changes between images when we do it in this way, but we found there to be very few large, structural changes (like buildings being built). One could further reduce the time range to within 1 month if worried about too much change occurring.

allenai / satlas-super-resolution

Bad results when retraining ESRGAN from scratch and retrain with pre-train weights #41