Questions about the number of output imgae

HaoDot commented 2 years ago

Hi , a nice work for event-based video unfolding! You have considered a practical setting, where the duty-cycle is not 1. However, there are still a few questions for me.

the number of output image

It seems like the final conv in EDVI has one channel for output, which means it synthesize one image every time. Your paper confirms that too! However, the blurry-sharp loss in EDVI needs M reconstructions. Besides, M has to be big enough. I don't know how to understand it.
the setting of section 5.2 and 5.3

EDVI is designed to fulfill motion deblur and interpolation at the same time. But you use it for different setting, my understanding for it is as followed: For the deblurring in section 5.2, the time f of latent frame to recover is inside the exposure time $1$ or $2$ . So, for the interpolation in section 5.3, EDVI trys to recover the intermediate frame between exposure times. I don't know whether I understand it right. Hope you can help me to understand the issues above. Thx a lot.

XiangZ-0 commented 2 years ago

Hi Ginobili-20,

Thanks for your interest in our work.

For the number of output images, our EVDI model only outputs one latent frame at the timestamp of interest per inference. In the training stage, we will generate N latent frames per input (you can consider it as running the model for N times) for once loss backpropagation. Please see Train. py for detail.
Your understanding is right :-) In our work, the tasks of deblurring and interpolation can be unified as discussed in Section 3. We separate the two tasks in the experiment section just to facilitate comparison with previous works.

HaoDot commented 2 years ago

Sorry for not replying immediately. Thx for your explanation! However, there is another remaining question. Hope you can reply to it again, thx. There are three loss functions in EVDI, Blurry-event loss, Blurry-sharp loss, and Sharp-event loss, respectively. Besides, from Table 3, it seems that Blurry-sharp loss plays an important role in supervision, others even can't help the model converge. For now, what is mentioned above is all from the original paper. But the fact is that Blurry-sharp loss can only play its role in the deblurring task, which means recovering the latent frame during the exposure time. So, B-S loss can not help the model to converge during the interpolation task, which is designed to recover the latent frames, not in exposure time. What's more, B-E loss will have a trivial solution, when $1$ is equal to $2$ , and $3$ is equal to $4$ . And, S-E loss will be influenced by the noise in event streams. As shown in Table 3, all losses mentioned above can't work well. So, chances are that the model can not converge well. However, EVDI still has a strong performance in the interpolation task. I can't understand the fact that finetuning with B-E loss and S-E loss can achieve such a good result. I wonder if there are other training strategies that I have missed. To sum up, the existing problem for me is how to supervise EVDI in the interpolation task. Waiting for your reply! Thx. P.S. EVDI is still a brilliant work, which makes an impression on me!

XiangZ-0 commented 2 years ago

Thanks for your question.

Losses: As stated in Sec. 5.4 of our paper, B-S (Blurry-sharp) loss contributes to brightness consistency, while B-E and S-E losses are designed to handle motion ambiguity. They all play important roles in our EVDI since we aim to recover sharp results (related to motion ambiguity) with correct brightness (related to brightness consistency). Although it seems that B-S loss achieves the best quantitative results compared with B-E and S-E losses in Tab. 3, it is because the metrics highly depend on pixel brightness, and thus they cannot tell the whole story. For instance, the qualitative results in Fig. 6 show that B-S loss ensures correct brightness but cannot provide sharp results like B-E loss. In fact, models with B-E and S-E losses also converge as the figure shown below, where the x-axis and y-axis indicate training epoch and normalized loss value, respectively. For the trivial solution of B-E loss, E(f,T_i)=B_i might occur if LDI networks take blurry frames as inputs and learn an identical mapping. But in our case, for a constant blurry frame B_i, different chosen timestamp t leads to different input events to LDI and thus different E(f,T_i), which potentially avoids the trivial solution.
Interpolation: As discussed above, motion ambiguity can be handled in both interpolation and deblurring (B-E and S-E losses). For brightness consistency, we train interpolation together with deblurring and use the same EVDI model to fulfill both tasks. Thus the constraint on brightness consistency is also valid in the interpolated frames.

Admittedly, EVDI is not perfect and there are some limitations in it such as the noise issue in S-E loss, but we hope EVDI could inspire more exciting works in the related field. Thanks.

HaoDot commented 2 years ago

Thanks for replying in detail again.

Losses: Thanks for showing convergence Figure, it did show that with individual loss function, EVDI can converge. Besides, each loss plays its own role in brightness domain and motion domain.
supervision in interpolation: Thanks for telling the fact that interpolation and deblurring are trained jointly! I was confused about num_leftB and num_rightB. Combining your explanation and the code below, I know how to select recovered outputs during the exposure time to synthesize blurry. https://github.com/XiangZ-0/EVDI/blob/a9a22ce4f671aa158bb8d2c6bbcb4325c07016e6/codes/Loss.py#L14 Finally, Thx for your detailed answers again!

XiangZ-0 / EVDI

Questions about the number of output imgae #2