X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Apache License 2.0
1.12k stars 68 forks source link

The images being sliced vertically. How can I make it horizontal. #56

Closed rjmehta1993 closed 2 months ago

rjmehta1993 commented 2 months ago

Hi, thanks for the amazing model.

I added the intermediate images of the grid anchor. I see the images are sliced vertically. How can I make it horizontal?

Original Image:

Screenshot 2024-04-19 at 11 26 24 AM

Intermediate anchor sliced image ("grid_9") 0_intermediate

HAWLYQ commented 2 months ago

Hi, @rjmehta1993 , sry, I don't quite understand "the images are sliced vertically". As shown in your sliced image, the raw image is cropped with the 2x3 shape and then horizontally concatenated.

rjmehta1993 commented 2 months ago

Hey, thanks for the quick reply. If you see the slicing, the slice is breaking the context of columns. If I avoid slicing by keeping grid_1, the resolution is very low. The slicing breaks the column and index of the table, thus if I ask a simple question "What is the management fee for column ITD?" It fails to answer.

Can I slice it horizontally? Something like in the bottom.

0_intermediate

HAWLYQ commented 2 months ago

Hi, @rjmehta1993 , sry, we currently don't support cropping a global image into multiple rectangular.

But, I think the failure of our model to answer the question is not due to the cropping but the Non-standard row-column relationships in this image.

Actually, I have tried parsing texts in the image with our HuggingFace demo, our model could well organize texts as the structure in the image, as shown below. This validates that the 2x3 cropping doesn't influence the model to understand texts at the same row. image

rjmehta1993 commented 2 months ago

I agree. The problem is not parsing or recognizing the text. But the problem is the spatial relation of column and index on the question.

It is not able to map the column and index to the question if the column is far from the index.

Not sure why. But thanks for this amazing model.

HAWLYQ commented 2 months ago

I agree. The problem is not parsing or recognizing the text. But the problem is the spatial relation of column and index on the question.

It is not able to map the column and index to the question if the column is far from the index.

Not sure why. But thanks for this amazing model.

Thanks for your appreciation. Such non-standard col-row relationship is still a challenging problem. We will further enhance the structure understanding abilities in our future work~