facebookresearch / MultiplexedOCR

Code for CVPR21 paper A Multiplexed Network for End-to-End, Multilingual OCR
Other
79 stars 10 forks source link

Why is performance so bad? [and relative to Apple iOS] #10

Closed davidbernat closed 1 year ago

davidbernat commented 1 year ago

You have no idea how much I appreciate that contributions FAIR continues to make to the next generation of Open Source (OS). I am running MultiplexOCR out-of-the-box on a high-resolution photograph of a high dpi published book of an easily identifiable font. Why is the performance so bad? Why is the performance so much worse than Apple iOS that gets this correct instantly?

Furthermore: I noticed that Multiplex OCR performed very well on non-book published text. Why the divergence? That cannot possibly be built into the model can it? And it would seem unlikely to be reflected choice in the training data.

This application I am working on is very important and could serve the FAIR and Facebook community tremendously. We are days way from execution and this step seems to be the only step holding us back. I do hope you will give us your attention on this.

After all: as they say, 'attention is all you need.' 😉

Also: why does the image not return the text? only the text segmentation. output1 outputhowtobrielarsondavidbernat

davidbernat commented 1 year ago

Several updates:

  1. I still am unable to have the model return any words, only segmentation.
  2. I found that cropping the book so that the words were relatively larger worked when run from the command line. But running from within an IDE with an identical setup produced stochastic results.
  3. The config that comes with the weights in the README.md did not produce results as well as the demo.yaml config.
  4. I made several small bug fixes, such as a missing logger. I will post those as more progress is made.

Is you can remedy these please do. Thanks.

SuperIRabbit commented 1 year ago

Hi @davidbernat,

Thank you for your interest in our work!

It's a known problem that the scene text images are very different from text in document and they don't transfer very well to each other, and unfortunately the public datasets we are using are mostly scene text. If your application needs to deal with documents, it would be better to train a different model using the document datasets.

For the issue of not returning any words, it's usually because the config doesn't match the model itself which messes up the character mapping. Could you double check if the config in README.md doesn't work with the weights?

davidbernat commented 1 year ago

That surprises me. The OCR worked so impressively well for scene text of varying size and scene noisiness. The use of a book photo, when tightly cropped and contiguous worked well at times and not at all at others. The irregularity of the performance on multiple repeat runs was surprising. The minimal transfer learning also surprises me as block text with handwritten notes seem to remove all functionality at times. These do not seem to be behaviors implicit in the architecture so their manifesting in the application of the tool is much of a surprise to me. You all are much smarter than I am and are doing something I could never do. Still the difference between intuition is surprising. I promise I will double check my execution of the code to see whether any other confounding variables could be present.

Regarding the configuration files: they are the same as the GitHub repository. Can you double check that those in the GitHub repository work as you expect and post several examples similar to what I am describing— books from phones and occasional handwritten annotations?

On Tue, Nov 8, 2022 at 6:28 PM Jing Huang @.***> wrote:

Hi @davidbernat https://github.com/davidbernat,

Thank you for your interest in our work!

It's a known problem that the scene text images are very different from text in document and they don't transfer very well to each other, and unfortunately the public datasets we are using are mostly scene text. If your application needs to deal with documents, it would be better to train a different model using the document datasets.

For the issue of not returning any words, it's usually because the config doesn't match the model itself which messes up the character mapping. Could you double check if the config in README.md doesn't work with the weights again?

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/MultiplexedOCR/issues/10#issuecomment-1307974308, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJFNI4J4EY7H5R5GVMMNB3WHLOZ3ANCNFSM6AAAAAARUQKJMQ . You are receiving this because you were mentioned.Message ID: @.***>

-- David Bernat, Ph. D. text: 917-825-7193 / LinkedIn https://www.linkedin.com/in/david-bernat/ All content should be considered proprietary to Starlight LLC and Confidential or higher

davidbernat commented 1 year ago

Please let me know when you have examples of this same behavior. Thank you.

On Tue, Nov 8, 2022 at 7:13 PM David Bernat @.***> wrote:

That surprises me. The OCR worked so impressively well for scene text of varying size and scene noisiness. The use of a book photo, when tightly cropped and contiguous worked well at times and not at all at others. The irregularity of the performance on multiple repeat runs was surprising. The minimal transfer learning also surprises me as block text with handwritten notes seem to remove all functionality at times. These do not seem to be behaviors implicit in the architecture so their manifesting in the application of the tool is much of a surprise to me. You all are much smarter than I am and are doing something I could never do. Still the difference between intuition is surprising. I promise I will double check my execution of the code to see whether any other confounding variables could be present.

Regarding the configuration files: they are the same as the GitHub repository. Can you double check that those in the GitHub repository work as you expect and post several examples similar to what I am describing— books from phones and occasional handwritten annotations?

On Tue, Nov 8, 2022 at 6:28 PM Jing Huang @.***> wrote:

Hi @davidbernat https://github.com/davidbernat,

Thank you for your interest in our work!

It's a known problem that the scene text images are very different from text in document and they don't transfer very well to each other, and unfortunately the public datasets we are using are mostly scene text. If your application needs to deal with documents, it would be better to train a different model using the document datasets.

For the issue of not returning any words, it's usually because the config doesn't match the model itself which messes up the character mapping. Could you double check if the config in README.md doesn't work with the weights again?

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/MultiplexedOCR/issues/10#issuecomment-1307974308, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJFNI4J4EY7H5R5GVMMNB3WHLOZ3ANCNFSM6AAAAAARUQKJMQ . You are receiving this because you were mentioned.Message ID: @.***>

-- David Bernat, Ph. D. text: 917-825-7193 / LinkedIn https://www.linkedin.com/in/david-bernat/ All content should be considered proprietary to Starlight LLC and Confidential or higher

SuperIRabbit commented 1 year ago

@davidbernat Have you modified the yaml file so that CHAR_MAP.DIR is pointing to the directory containing the character map jsons? I just uploaded an example notebook for your reference: https://github.com/facebookresearch/MultiplexedOCR/blob/main/notebook/inference/demo.ipynb

davidbernat commented 1 year ago

Please do not close this ticket. None of the code from your repository was modified in my runs, as I stated in my email. Thanks. :-)

SuperIRabbit commented 1 year ago

Please do not close this ticket. None of the code from your repository was modified in my runs, as I stated in my email. Thanks. :-)

You should at least modify the yaml file so that CHAR_MAP.DIR is pointing to the directory containing the character map jsons (see the readme file in the repo), otherwise it won't work. Let me know if you are able to reproduce the example notebook above :-)