Closed 7998857 closed 2 months ago
Hi, @7998857, thanks for pointing out this error. I have observed the same outputs with this image and the instruction, thus I think you have used the model correctly, the two warnings don't influence the model inference.
Honestly, I haven't come up with reasons for the wrong order. I will record this error and if I find the cause I'll let you know as soon as possible.
Besides, we plan to release the training code at the end of this month. If you want better performance in your domain, you can try finetuning our model with your data~
Hey, thanks for the fast reply! And okay, I'll wait for the train code and try to adapt the model to our domain. Looking forward to it.
Hello, maybe I have found a hint, as to why the model sometimes gets the order of the texts wrong. I have worked with your multi grained text localization data for the DocStruct4M dataset a bit and found a significant amount of label errors. So far these seem to occur mostly in the TextVQA and ChartQA subsets. An example would be this one
(I hope one can see it well enough: the text label is "sensitivity* touchscreen" where the order is wrong)
Here the code to review the labels
import pandas as pd
label_data = pd.read_json(path_or_buf="multi_grained_text_localization.jsonl", lines=True)
label_data.image = label_data.image.apply(lambda image_paths: image_paths[0])
subset = label_data[label_data.image.apply(lambda x: "c09bb959a7777b5.jpg" in x)]
print([msg["content"] for msgs in subset.messages for msg in msgs if "sensitivity" in msg["content"]])
Which gives:
['<|image|>Give the bounding box of the text <ocr> sensitivity* touchscreen </ocr>', '<ocr> sensitivity* touchscreen </ocr>', '<|image|>Predict the bounding box of the text <ocr> sensitivity* touchscreen \n asivds bns aisup </ocr>']
This kind of error occurs rather often in the TextVQA subset. Overall the quality of the labels seems to be rather bad there. The ChartQA shows similar errors. As far as I can see, both sets together make about 30% of the localization dataset which could explain the models problem with word order.
Hi, @7998857 Thanks for your hard work! I agree with your guess.
It is easy to cause order errors in TextVQA because the texts are organized according to their bbox position. The bbox with smaller y (distance to the upper boundary) and x (distance to the left boundary) will be placed in front. In this case, the y of "sensitivity" is smaller than "touchscreen", so it is placed before it. Such an order is not consistent with the reading order. It's difficult to resolve this issue because the text orientation in natural images is quite diverse.
We will try to fix such errors in the future work. If you have any good ideas and are willing to share them with us, we will appreciate very much!
Hey, thanks for the information. That makes sense now. We had similar problems at projects at work, too. If I come to a solution that works here, I will post it. For now, I will just exclude the critical subsets in my project.
Hi, @7998857 , We have released training codes for finetuning docowl1.5 in https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5. It's temporarily supported by DeepSpeed zero2. We meet deadlock issues with zero3, if you have any suggestions to share with us, we will appreciate very much~
Hello, I pulled your repo and so far the inference with the stage 1 model works fine. However, the results I get for the localized text recognition often are in the wrong order. For example, I use this code (basically the demo code from the README.md):
on this image (only the relevant part is left visible)
Which gives the result 8 Spl. Fz.z.Pers.bef.b. 5
Here, the two parts "8 Spl." and "Fz.z.Pers.bef.b." are in the wrong order (the "5" in the end is hallucinated, but that only happens in the anonymized image, not in the original one -> no concern here). Something like that happens quite often. I have the feeling that I missed something there. Do I use the model correctly?
There is indeed a warning the code throws during inferene:
And also one during model loading: