Closed calvinzhan closed 1 year ago
I check the paddle implementations here
def _cal_spatial_position_embeddings(self, bbox):
try:
left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
except IndexError as e:
raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e
h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
return (
left_position_embeddings,
upper_position_embeddings,
right_position_embeddings,
lower_position_embeddings,
h_position_embeddings,
w_position_embeddings,
)
As you can see, bbox[:, :, 3] - bbox[:, :, 1]
calculates the height and bbox[:, :, 2] - bbox[:, :, 0]
calculates the width. The code takes care of the box_width and box_height itself, so we should the order as the LayoutLMv3 .
In Paddle, bbox is [ top_left_point_x, top_left_point_y, box_width, box_height]; while in LayoutLMv3_DocVQA, bbox is [ top_left_point_x, top_left_point_y, right_bottom_x, right_bottom_y ]. Which one should we use?
Since the model is converted from Paddle, should it be trained with the first bbox setting? And we should follow it during finetune?