XiangZ-0 / EVDI

Implementation of CVPR'22 paper "Unifying Motion Deblurring and Frame Interpolation with Events"
68 stars 6 forks source link

Questions about the data process #5

Open SecondHupuJR opened 2 years ago

SecondHupuJR commented 2 years ago

Congrats on the great work! This is a great work. I beg some replies to my confusion. In the code line 77, utils.py, new_t is divided by interval, which is (total_end-total_start)/num_frame, and num_frame here is the num_bins (set to 16) according to the code. https://github.com/XiangZ-0/EVDI/blob/a9a22ce4f671aa158bb8d2c6bbcb4325c07016e6/codes/util.py#L77 If what I think is right, in the C channel of event tensor N,2C,H,W, there will be some channel with only zeros in it. Because ts is always smaller than total_end. Is this the situation?

Besides, The shape of the event tensor for the model will be (batchsize49, 162, H, W), which is quite big. Is this the reason that experiments on GoPro are conducted on 160x320? Thank you for the good work!

XiangZ-0 commented 2 years ago

Hello SecondHupuJR, thanks for your interest in our work. You are right about the event tensors, each of which is constructed based on (total_end-total_start)/num_frame and sometimes contains zeros in it. This design ensures the same temporal resolution in different event tensors, which is important since we use weight-sharing LDI networks. Furthermore, it enables arbitrary choice of the target timestamp without changing the input shape of event tensor. For the GoPro experiments, it is feasible to conduct experiments on the full image size of the original REDS dataset. For example, we can crop the input data to 256x256 or 128x128 for training despite the shape of event tensor being big. And we use 160x320 just for efficient training and testing, as the amount of simulated event data at full image size is quite large.

SecondHupuJR commented 2 years ago

Hi XiangZ-0, many thanks for the reply!

In my second question, what I mean is that the event tensor's shape is (batchsize*49, 162, H, W), which is quite large. For the images in GoPro (1280x720), is it possible to inference the full image? in that case the H and W will be 720 and 1280, respectively.

XiangZ-0 commented 2 years ago

Actually, event tensors with the shape (batchsizex49, num_binsx2x4, H, W) are only needed for training, and the number 49 means that we recover 49 latent frames simultaneously for computing the blur-sharp loss. During inference, it is ok to process 1280x720 images since each LDI network only takes (batchsize (usually set to 1), num_binsx2, H, W) as input, so that would not be a problem. Thanks for your question.

SecondHupuJR commented 2 years ago

Oh, right!

Thank you for the reply! This is really a nice work. Besides, I'm wondering if the experiments are conduct on synthetic data with ground truth in supervised fasion, and then fine-tune on the real event data with unsupervised fashion, will the result be better?

XiangZ-0 commented 2 years ago

Probably yes. We used to train EVDI with ground truth sharp images on the GoPro dataset a long time ago, and I remembered that the supervised EVDI model surpasses its self-supervised counterpart by around 1-2 dB in PSNR due to the strong supervision signals from gt images. You are also welcome to validate this by yourself :-)