DocOwl1.5: Inference results often in wrong order

7998857 commented 2 months ago

Hello, I pulled your repo and so far the inference with the stage 1 model works fine. However, the results I get for the localized text recognition often are in the wrong order. For example, I use this code (basically the demo code from the README.md):

from docowl_infer import DocOwlInfer
model_path = "./models/models--mPLUG--DocOwl1.5-stage1/.../"
docowl = DocOwlInfer(ckpt_path=model_path, anchors="grid_9", add_global_img=False)

image = "image.jpg"
query = "Identify the text within the bounding box <bbox>92, 444, 880, 480</bbox>"
answer = docowl.inference(image, query)

print(answer)

on this image (only the relevant part is left visible)

52_82_combined_0dfL_0_mittlere_seite_original

Which gives the result 8 Spl. Fz.z.Pers.bef.b. 5

Here, the two parts "8 Spl." and "Fz.z.Pers.bef.b." are in the wrong order (the "5" in the end is hallucinated, but that only happens in the anonymized image, not in the original one -> no concern here). Something like that happens quite often. I have the feeling that I missed something there. Do I use the model correctly?

There is indeed a warning the code throws during inferene:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

And also one during model loading:

Some weights of MPLUGDocOwlLlamaForCausalLM were not initialized from the model checkpoint at ... and are newly initialized: ['model.layers.4.self_attn.rotary_emb.inv_freq', ..., 'model.layers.2.self_attn.rotary_emb.inv_freq']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

HAWLYQ commented 2 months ago

Hi, @7998857, thanks for pointing out this error. I have observed the same outputs with this image and the instruction, thus I think you have used the model correctly, the two warnings don't influence the model inference.

Honestly, I haven't come up with reasons for the wrong order. I will record this error and if I find the cause I'll let you know as soon as possible.

Besides, we plan to release the training code at the end of this month. If you want better performance in your domain, you can try finetuning our model with your data~

7998857 commented 2 months ago

Hey, thanks for the fast reply! And okay, I'll wait for the train code and try to adapt the model to our domain. Looking forward to it.

7998857 commented 2 months ago

Hello, maybe I have found a hint, as to why the model sometimes gets the order of the texts wrong. I have worked with your multi grained text localization data for the DocStruct4M dataset a bit and found a significant amount of label errors. So far these seem to occur mostly in the TextVQA and ChartQA subsets. An example would be this one

(I hope one can see it well enough: the text label is "sensitivity* touchscreen" where the order is wrong)

Here the code to review the labels

import pandas as pd
label_data = pd.read_json(path_or_buf="multi_grained_text_localization.jsonl", lines=True)
label_data.image = label_data.image.apply(lambda image_paths: image_paths[0])

subset = label_data[label_data.image.apply(lambda x: "c09bb959a7777b5.jpg" in x)]

print([msg["content"] for msgs in subset.messages for msg in msgs if "sensitivity" in msg["content"]])

Which gives:

['<|image|>Give the bounding box of the text <ocr> sensitivity* touchscreen </ocr>', '<ocr> sensitivity* touchscreen </ocr>', '<|image|>Predict the bounding box of the text <ocr> sensitivity* touchscreen \n asivds bns aisup </ocr>']

This kind of error occurs rather often in the TextVQA subset. Overall the quality of the labels seems to be rather bad there. The ChartQA shows similar errors. As far as I can see, both sets together make about 30% of the localization dataset which could explain the models problem with word order.

HAWLYQ commented 2 months ago

Hi, @7998857 Thanks for your hard work! I agree with your guess.

It is easy to cause order errors in TextVQA because the texts are organized according to their bbox position. The bbox with smaller y (distance to the upper boundary) and x (distance to the left boundary) will be placed in front. In this case, the y of "sensitivity" is smaller than "touchscreen", so it is placed before it. Such an order is not consistent with the reading order. It's difficult to resolve this issue because the text orientation in natural images is quite diverse.

We will try to fix such errors in the future work. If you have any good ideas and are willing to share them with us, we will appreciate very much!

7998857 commented 2 months ago

Hey, thanks for the information. That makes sense now. We had similar problems at projects at work, too. If I come to a solution that works here, I will post it. For now, I will just exclude the critical subsets in my project.

HAWLYQ commented 1 month ago

Hi, @7998857 , We have released training codes for finetuning docowl1.5 in https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5. It's temporarily supported by DeepSpeed zero2. We meet deadlock issues with zero3, if you have any suggestions to share with us, we will appreciate very much~

X-PLUG / mPLUG-DocOwl

DocOwl1.5: Inference results often in wrong order #52