NormXU / ERNIE-Layout-Pytorch

An unofficial Pytorch implementation of ERNIE-Layout which is originally released through PaddleNLP.
http://arxiv.org/abs/2210.06155
MIT License
99 stars 11 forks source link

Should we use 2 points or point + width/height in bbox for model training? #15

Closed calvinzhan closed 1 year ago

calvinzhan commented 1 year ago

In Paddle, bbox is [ top_left_point_x, top_left_point_y, box_width, box_height]; while in LayoutLMv3_DocVQA, bbox is [ top_left_point_x, top_left_point_y, right_bottom_x, right_bottom_y ]. Which one should we use?

Since the model is converted from Paddle, should it be trained with the first bbox setting? And we should follow it during finetune?

NormXU commented 1 year ago

I check the paddle implementations here

def _cal_spatial_position_embeddings(self, bbox):
        try:
            left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
            upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
            right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
            lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
        except IndexError as e:
            raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e

        h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
        w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
        return (
            left_position_embeddings,
            upper_position_embeddings,
            right_position_embeddings,
            lower_position_embeddings,
            h_position_embeddings,
            w_position_embeddings,
        )

As you can see, bbox[:, :, 3] - bbox[:, :, 1] calculates the height and bbox[:, :, 2] - bbox[:, :, 0] calculates the width. The code takes care of the box_width and box_height itself, so we should the order as the LayoutLMv3 .