fh2019ustc / DocTr

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.
Other
345 stars 48 forks source link

four question about GeoTr.py #15

Closed hanquansanren closed 1 year ago

hanquansanren commented 1 year ago

hello hao, I have read your pioneering work on using ViT for document unwarping. In the paper, the design of the geometry tail is marvellous. which proposed a learnable module to perform upsampling on the decoded features $f{d}$. As shown in the following figure: image for the part. I have four questions: Q1: Based on my understanding, such a design is essentially the local dot product of two features map(i mean $f{o}$ and $f_m$ ). Do I understand correctly? I feel this design is un-imaginable for me. So, I wonder what is your motivation for this tail design? Is there any similar design in other reference papers?

Q2: in the following codeblock, why the flow need to multiply by 8 https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L211

Q3: in the following codeblock, why the mask need to be operated by softmax ? Is the softmax operation have some special significance here? https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L209

Q4: in the following codeblock, why coodslar should be added to pred backward mapping? Is this operation important? My guess is that the operation here is similar to a kind of position encoding. But here is the final operation in the network, why don't you add this position encoding to previous layer? https://github.com/fh2019ustc/DocTr/blob/bbb1af9c01788bc28f5249ea14ea66d2b9f55353/GeoTr.py#L231

many thanks to your explanation.

Best wishes, Weiguang Zhang

fh2019ustc commented 1 year ago

Hi, in fact, this is not our contribution. In our new paper in ECCV 2022, we have shown that this module is introduced from the RAFT for optical flow. In fact, the task document image rectification can be regarded as a monocular flow regression problem. For your Q4, based on our experience, it is easy to estimate pixel-wise displacement rather than absolute coordinates.

hanquansanren commented 1 year ago

Thanks a lot for your response, I will further check the RAFT for more detail. (づ ̄3 ̄)づ╭❤~