diegovalsesia / deepsum

DeepSUM: Deep neural network for Super-resolution of Unregistered Multitemporal images (ESA PROBA-V challenge)
48 stars 15 forks source link

How to pre-train the RegNet #3

Open lionlai1989 opened 4 years ago

lionlai1989 commented 4 years ago

Now I want to pre-train the RegNet with sentinel-2 and SPOT images. But I don't know how to pre-train RegNet without knowing the ground truth.

In your paper, it said

The input data to be used for the pretraining of RegNet are the feature maps produced by the pretrained SISRNet for the images in the training set.

Then the next sentence is

As described in Sec. IV-B, the input to RegNet are N feature maps from images of the same scene. These feature maps are then synthetically shifted with respect to the first one by a random integer amount of pixels. The purpose is to create a balanced dataset where all possible K 2 classes (shifts) are seen by the network. The desired output is a filter with all zeros except for a one in the position corresponding to the chosen shift.

I don't really understand this part. Where do I get the random integer amount of pixels from? If it's random, does it mean the ground truth is a random vector with a component being 1 and the rest are 0?

My question is what is ground truth data when pretraining the RegNet?

Thank you.

shyu4184 commented 4 years ago

@lionlai1989 Hello, Did you solve this problem? I'm implementing the RegNet using pytorch and my own dataset. In my case, I translated the feature maps by the amount of randomly selected integer translation. In addition, I generated the ground truth filters as mentioned in the paper.
However, the obtained filters are so different compared with the ground truth one.

Would you share you approach?

lionlai1989 commented 4 years ago

Hello @shyu4184 Right now I am using the Arthur's tensorflow code with my sentinel-2 (LR) and SPOT(HR) dataset. I tried to avoid modifying the code of network itself and instead feed my dataset into the training and testing process.

  1. So, in the network,py, there are four lines of code which I don't know where to get them from. All I have is input (LR) and ground truth(HR). For me, the mask is not necessary because all my images have no cloud. And I don't know what are other three input.

    self.mask_y=tf.placeholder('float32',shape=[None,1,None,None,1],name='mask_y')

    self.y_filters=tf.placeholder('float32',shape=[None,None,self.dyn_filter_size**2],name='y_filters')

    self.fill_coeff=tf.placeholder(tf.float32,shape=[None,self.T_in,self.T_in,None,None,1],name='fill_coeff')

    self.norm_baseline=tf.placeholder('float32',shape=[None,1],name='norm_baseline')

  2. Back to your questions, you said you translated the feature maps by the amount of randomly selected integer translation. Here, I think feature maps is the output of SISRNet, right? And here I don't understand why we can random shift the feature maps wrt the first one by a random integer amount of pixels. I mean this shifted pixel will be a ground truth when pre-train the network, How can we randomly shift them?

  3. And I don't understand how does the registered shift related to the one-shot softmax layer? I mean the value of shift is 1,2,3..., and how can 1,2,3 be transfer to different classes which is the number of k*K (filter size).

Sorry for bringing more questions. If you can help me to understand the implementation.

shyu4184 commented 4 years ago

@lionlai1989 Since I only try to implement the RegNet, I cannot answer the usage of four variables in the first question. Yeah, you're right. The feature maps are the output of SISRNet. The first one is a reference for the rest of feature maps. In addition, since the role of the RegNet is to align the misaligned super-resolved feature maps, we need to translate the feature maps to train the RegNet. Pytorch has a function to shift the feature map as torch.roll. I'm not sure, it maybe tf.roll in the tensorflow.

I also confused that how they generate the ground truth filter. As mentioned in the paper, they synthesized the ground truth filter as a convolution filter whose intensity value is 1 at the shifted pixel and the rest of the intensity value is 0 such as delta filter. Although I followed their explanation, but I'm not sure it's correctly implemented or not.

lionlai1989 commented 4 years ago

@shyu4184 Thank you for your feedback. It feels so good to have someone to talk wrt this topic. My question is how to generate the ground truth of RegNet when pretraining?

For example, if a image's shift is (x=1, y=2) pixels (paper said it has to be interger) wrt the reference image. Then what should the ground truth vector look like? And how is it related to the number of filter k*k? It is the part really confuse me.

shyu4184 commented 4 years ago

@lionlai1989

Sorry for late reply. Author said that they generated a filter which has the intensity value of 1 at the shifted location and 0 for the rest of intensities. That is, if you shifted by (x=1, y=2), the ground truth filter has the intensity value of 1 at the location of (1, 2) and the rest pixels have zero values such as a shifted delta filter. In this work, as far as I understood, the ground truth and trained filters are regarded as a 2-dimensional convolution filter. And if the size of filter is k*k, it has the k^2 cases of translation (classes).