FelixHertlein / inv3d

Project page for the ICDAR 2023 Paper "Inv3D: a high-resolution 3D invoice dataset for template-guided single-image document unwarping".
https://felixhertlein.github.io/inv3d/
12 stars 0 forks source link

A question about bm data processing #3

Closed shallweiwei closed 4 months ago

shallweiwei commented 4 months ago

I'm a little confused about this.crop and bm/448. After the image cropping I think bm should not be divided by 448. Could you please answer this question?Thanks!

FelixHertlein commented 4 months ago

While creating the Inv3D dataset, we render a larger area around the invoices to ensure the document is fully contained in the rendering. In order to make the document dewarping task on Inv3D comparable to the Doc3D, we remove the surrounding area using the ground truth. In a production setting, one needs to detect the invoice first in order to remove the excess surroundings.

The division of the BM by 448 is only needed and applied for the Doc3D dataset. Their BM data format differs from ours. They define the map as a 448x448x2 matrix with the values ranging from 0 to 448. We on the other hand, define the backward map as a matrix with the values ranging from 0 to 1. This is why we divide by 448 in order to bring all data to the same format.

shallweiwei commented 4 months ago

Sorry for the late reply. What I want to say here is that bm/448 is supposed to be because it is not cropped. Once cropped, the position relationship between pixels will change. For example, if the cropped graph becomes 256*300, bm should be divided by 256 and 300, respectively

FelixHertlein commented 4 months ago

The division by 448 only normalizes the backward mapping vectors in BM from [0, 448] x [0, 448] to [0, 1] x [0, 1], meaning that (0,0) is the top left corner and (1, 1) the bottom right corner. The value normalization is independent from the cropping.

When cropping an image, the BM has to be changed accordingly in order to to keep the mapping correct. We crop the images here and the BM here.

The BM cropping looks like this:

def tight_crop_map(input_map: np.ndarray):
    check_tensor(input_map, "h w 2")

    input_map -= np.nanmin(input_map, axis=(0, 1), keepdims=True)
    input_map /= np.nanmax(input_map, axis=(0, 1), keepdims=True)
    return input_map

A BM map which contains a margin of 10% around the foreground object looks like this: [0.1, 0.9] ^ {448 x 448}. By removing the margin you need to normalize the BM vectors such that the BM looks like this: [0, 1]^ {448 x 448}. The resolution 448 x 448 only denotes the resolution of the unwarped image, not where the pixels for the unwarped image origin from.

Hope this clarifies your question.

shallweiwei commented 4 months ago

Thank you very much for your patient reply. My question comes from the implementation of another very good work,UVDoc crop Here grid2D is actually a sparser bm.Through this processing, bm can unwarp the cropped picture by grid_sample.

FelixHertlein commented 4 months ago

The sparsity of the BM is not relevant for the BM cropping. The BM is essentially a matrix of vectors with each of the vectors specifying, from which position in the original image the color value should get picked. This works for any resolution h x w.

The work of UVDoc uses the same method as we do for BM normalization:

        grid2D[0, :, :] = (grid2D[0, :, :] - left) / (size[1] - left - right)
        grid2D[1, :, :] = (grid2D[1, :, :] - top) / (size[0] - top - bot)

this is essentially the same as we do:

    input_map -= np.nanmin(input_map, axis=(0, 1), keepdims=True)
    input_map /= np.nanmax(input_map, axis=(0, 1), keepdims=True)

The only difference is that our coordinates are relative (i. e. between 0 and 1) whereas the coordinates in UVDoc are in pixels.

When applying the BM to an image (see here), we transform the coordinates to [-1, 1] as grid_sample(..) expects the BM vectors to have this range:

      bm = (bm * 2) - 1

From the documentation here

grid specifies the sampling pixel locations normalized by the input spatial dimensions. Therefore, it should have most values in the range of [-1, 1]. For example, values x = -1, y = -1 is the left-top pixel of input, and values x = 1, y = 1 is the right-bottom pixel of input.

UVDoc needs to transform their BM from absolute to relative coordinates inbetween -1 and 1 as well before applying grid_sample. I am not sure where they do that in their code base.

shallweiwei commented 4 months ago

Thank you for your patience. I thought about it again, and I think the reason why I have such problems is that I did not take into account that UVDoc realizes data augmentation through different crops, so it is necessary to consider the above and below crops when doing subtraction and division operations. Your implementation is to find the edge directly from the UV diagram for clipping, so you only need to consider the maximum and minimum bm. I think my question has been solved, thank you again for your answer.