doc-analysis / DocBank

DocBank: A Benchmark Dataset for Document Layout Analysis
Apache License 2.0
585 stars 72 forks source link

some labels are missing #52

Open alireza-hariri opened 2 weeks ago

alireza-hariri commented 2 weeks ago

image I just noticed that some words in the cover image are missing.

I couldn't find any code for generating this dataset from the original docs to suggest an edit.

Note: The second error in the image is the word "second" which splited with a dash. This err makes sense but I couldn't reason about the first error.

alireza-hariri commented 2 weeks ago

after more inspection i found some other problems

but there are some other problems with box sizes:

  1. There are a lot of boxes with zero width or height (even when the label is "paragraph" and the token doesn't include "Line##" )
  2. There are a lot of boxes (with paragraph label) that are too tall (see the image)

image