Closed hanquansanren closed 1 year ago
Hi, in fact, this is not our contribution. In our new paper in ECCV 2022, we have shown that this module is introduced from the RAFT for optical flow. In fact, the task document image rectification can be regarded as a monocular flow regression problem. For your Q4, based on our experience, it is easy to estimate pixel-wise displacement rather than absolute coordinates.
Thanks a lot for your response, I will further check the RAFT for more detail. (づ ̄3 ̄)づ╭❤~
hello hao, I have read your pioneering work on using ViT for document unwarping. In the paper, the design of the geometry tail is marvellous. which proposed a learnable module to perform upsampling on the decoded features $f{d}$. As shown in the following figure: for the part. I have four questions: Q1: Based on my understanding, such a design is essentially the local dot product of two features map(i mean $f{o}$ and $f_m$ ). Do I understand correctly? I feel this design is un-imaginable for me. So, I wonder what is your motivation for this tail design? Is there any similar design in other reference papers?
Q2: in the following codeblock, why the
flow
need to multiply by 8 https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L211Q3: in the following codeblock, why the
mask
need to be operated by softmax ? Is the softmax operation have some special significance here? https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L209Q4: in the following codeblock, why
coodslar
should be added to pred backward mapping? Is this operation important? My guess is that the operation here is similar to a kind of position encoding. But here is the final operation in the network, why don't you add this position encoding to previous layer? https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L231many thanks to your explanation.
Best wishes, Weiguang Zhang