Closed wuaodi closed 1 year ago
Hello Wuaodi! Thanks for taking an interest in our work! and sorry for the late reply. It is project proposal season here in Korea, so things are a bit busy atm!
I'd happy to answer your inquiries. Please see below
You are right in that thermal infrared images are usually read as single channel images. However, in our network, we basically expanded the single channel TIR image into 3 channel TIR image for two reasons: 1. Network architecture would be much more easier to implement and be adapted for other tasks other than TIR image (Using 3 channel TIR images over fixing the entire network architecture to fit the 1 channel would be much easier). 2. We wanted to leverage transfer learning from pretrained ImageNet to some extent. (although ImageNet is pretrained on RGB, its expressivity on TIR image is still some what a good starting point with limited number of TIR dataset) As imageNet pretrained networks intake 3 channel image backbones, I used 3 channels to suit the needs.
Short answer would be empirical fine tuning with some form of predictions on what the network loss would learn/which network loss should be prioritized over time. When training the architecture, the aspects we cherished the most is the network's ability to reconstruct the original TIR image as well as the ability to reconstruct the given edges from the translated TIR images. Keeping this goal in mind, we tended to set the values higher for image recon+LoG loss higher than the others. Apart from that, we set all of the ratios as 10. Empirically, if we were to set either any of the remaining losses to be higher, our network didn't really converge that well. In regards to your "similar experiment" you mentioned, I'm not sure what losses you have used. However, I did practically found out that if you let the RGB side (i.e. TIR to RGB translation side) train for too long, the model tends to collapse in the reconstruction loss side. This is something maybe you should keep it in mind when you choose the coefficients for your losses
I hope our reply answer your questions If you have any further questions or things you'd like to discuss, feel free to reply to the post or send us an email! We'd be happy to chat :)
Thank you for your reply! I got a lot of inspiration from your answer.
But I still have a question about the
$$ \begin{aligned} \mathcal{L}{L a p} & =\mathbb{E}\left[\left|L\left(x{T I R}\right)-L\left(x_{T I R, \text { recon }}\right)\right|_1\right] \ \end{aligned} $$
What I want to know is where the $\left(x_{T I R, \text { recon }}\right)\$ comes from. I think this loss is used for constraining the edge similarity between the input RGB image and the generating TIR image just like the fig. 4 in your paper, but maybe I misunderstood.
So could you explain this formula further?
Good luck with your job and project proposal!
Ah I see where your confusion is! So Figure 4 and the LoG loss used for training is completely different!
TL;DR: Figure 4 and LoG loss is a completely separate procedure, LoG loss is applied during training, Figure 4 is the procedure that happens after the training (during inference!)
Here are the detailed explanation:
In essence from doing this, regardless of which modality you translate, we tried to impose edge-guided decoding operation given any content and style vector from the given domain for that specific decoder.
I hope this answers your questions! If you are still confused, please let me know :)
I think I have fully understood your ideas and loss function composition, thank you for your detailed and patient answer!
Hi, thank you for your significant and interesting work!
I have two questions about the loss fucntion:
$$ \begin{aligned} \mathcal{L}{L a p} & =\mathbb{E}\left[\left|L\left(x{T I R}\right)-L\left(x_{T I R, \text { recon }}\right)\right|1\right] \ L\left(x{T I R}\right) & =\frac{1}{3}\left(L\left(x{T I R}^1\right)+L\left(x{T I R}^2\right)+L\left(x_{T I R}^3\right)\right) \end{aligned} $$
This loss is LoG loss which constrains the edge similarity between the input RGB image and the generating TIR image. Howerer, I don't understant why it is $L\left(x{T I R}\right)$ in the above formula, and in my view, the $L\left(x{T I R}\right)$ only have one channel.
I am very much looking forward to your reply! Thank you again for this meaningful work.