RPM-Robotics-Lab / sRGB-TIR

Repository for synthetic RGB to Thermal Infrared translation module from "Edge-guided multidomain RGB to TIR translation", ICRA 2023 submission
MIT License
72 stars 3 forks source link

A confusion about the loss fuction #2

Closed wuaodi closed 1 year ago

wuaodi commented 1 year ago

Hi, thank you for your significant and interesting work!

I have two questions about the loss fucntion:

  1. $$ \begin{aligned} \mathcal{L}{L a p} & =\mathbb{E}\left[\left|L\left(x{T I R}\right)-L\left(x_{T I R, \text { recon }}\right)\right|1\right] \ L\left(x{T I R}\right) & =\frac{1}{3}\left(L\left(x{T I R}^1\right)+L\left(x{T I R}^2\right)+L\left(x_{T I R}^3\right)\right) \end{aligned} $$

This loss is LoG loss which constrains the edge similarity between the input RGB image and the generating TIR image. Howerer, I don't understant why it is $L\left(x{T I R}\right)$ in the above formula, and in my view, the $L\left(x{T I R}\right)$ only have one channel.

  1. The loss weighting coefficients were set to 20, 10, 10, 20, and 5 respectively. How do you determine these coefficients? Did you try other coefficients? I have a very similar experiment and I found that differernt coefficients have different results which sometimes is good and sometimes bad.

I am very much looking forward to your reply! Thank you again for this meaningful work.

rpmsnu commented 1 year ago

Hello Wuaodi! Thanks for taking an interest in our work! and sorry for the late reply. It is project proposal season here in Korea, so things are a bit busy atm!

I'd happy to answer your inquiries. Please see below

  1. You are right in that thermal infrared images are usually read as single channel images. However, in our network, we basically expanded the single channel TIR image into 3 channel TIR image for two reasons: 1. Network architecture would be much more easier to implement and be adapted for other tasks other than TIR image (Using 3 channel TIR images over fixing the entire network architecture to fit the 1 channel would be much easier). 2. We wanted to leverage transfer learning from pretrained ImageNet to some extent. (although ImageNet is pretrained on RGB, its expressivity on TIR image is still some what a good starting point with limited number of TIR dataset) As imageNet pretrained networks intake 3 channel image backbones, I used 3 channels to suit the needs.

  2. Short answer would be empirical fine tuning with some form of predictions on what the network loss would learn/which network loss should be prioritized over time. When training the architecture, the aspects we cherished the most is the network's ability to reconstruct the original TIR image as well as the ability to reconstruct the given edges from the translated TIR images. Keeping this goal in mind, we tended to set the values higher for image recon+LoG loss higher than the others. Apart from that, we set all of the ratios as 10. Empirically, if we were to set either any of the remaining losses to be higher, our network didn't really converge that well. In regards to your "similar experiment" you mentioned, I'm not sure what losses you have used. However, I did practically found out that if you let the RGB side (i.e. TIR to RGB translation side) train for too long, the model tends to collapse in the reconstruction loss side. This is something maybe you should keep it in mind when you choose the coefficients for your losses

I hope our reply answer your questions If you have any further questions or things you'd like to discuss, feel free to reply to the post or send us an email! We'd be happy to chat :)

wuaodi commented 1 year ago

Thank you for your reply! I got a lot of inspiration from your answer.

But I still have a question about the

$$ \begin{aligned} \mathcal{L}{L a p} & =\mathbb{E}\left[\left|L\left(x{T I R}\right)-L\left(x_{T I R, \text { recon }}\right)\right|_1\right] \ \end{aligned} $$

What I want to know is where the $\left(x_{T I R, \text { recon }}\right)\$ comes from. I think this loss is used for constraining the edge similarity between the input RGB image and the generating TIR image just like the fig. 4 in your paper, but maybe I misunderstood.

So could you explain this formula further?

Good luck with your job and project proposal!

rpmsnu commented 1 year ago

Ah I see where your confusion is! So Figure 4 and the LoG loss used for training is completely different!

TL;DR: Figure 4 and LoG loss is a completely separate procedure, LoG loss is applied during training, Figure 4 is the procedure that happens after the training (during inference!)

Here are the detailed explanation:

  1. LoG used in the training: The equation is correct. When we reconstruct a TIR image, we first decompose the image into content and style. Afterwards, using the same content and style that we just decomposed, we reconstruct the original TIR image = which is known as $x{TIR}{recon}$. Essentially you compute LoG loss between $x{TIR}{recon}$ and $x{TIR}$. The same goes for RGB! (you compute LoG loss between $x{RGB}{recon}$ and $x{RGB}$.)

In essence from doing this, regardless of which modality you translate, we tried to impose edge-guided decoding operation given any content and style vector from the given domain for that specific decoder.

  1. So Figure 4 in the paper refers to the style selection procedure which happens after you translate the images! When you translate a bunch of RGB images into TIR, unless you use example--guided translation, you usually sample a bunch of latent code z (from normal distribution) and use them as style vectors (in essence, the network is trained to cope with varying latent code z really). So in reality, you would get N number of different images sampled for N number of different latent code Z! But how do you decide on which one is the most suitable for the given task? or which "latent code" ensures the best quality for all the tasks? (Obviously, we can't select ones we think is the best based on our view right?) To devise a reasonable and a fair procedure, we proposed something called "Style selection procedure". Given an input RGB you want to translate, you first translate a bunch of TIR images with N number of different z latent codes, yielding N number of different translated TIR images. Afterwards, you apply LoG filter to both RGB and each N number of translated TIR image. Finally, by comparing the SSIM (or PSNR) between the extracted edges of RGB image with that of TIR from each style, we select which style (or latent code) should be used for training whatever task you want to achieve!.

I hope this answers your questions! If you are still confused, please let me know :)

wuaodi commented 1 year ago

I think I have fully understood your ideas and loss function composition, thank you for your detailed and patient answer!