doc-analysis / TableBank

TableBank: A Benchmark Dataset for Table Detection and Recognition
Apache License 2.0
987 stars 139 forks source link

Question on the quality of table annotations #19

Open julianyulu opened 4 years ago

julianyulu commented 4 years ago

Hi,

The Problem

I'd like to thank you for releasing this nice dataset. However, I found the quality of the annotation is actually not quite high, mainly two issues:

  1. Missing labels: no annotation found for an existing table
  2. Inaccurate annotations: some bbox does not cover the whole table region

Issue 1 has been mentioned by #9 , where the author answer by

some error may cause a little table unlabeled

However, I plotted the first 100 image ids and their annotations in /Detection_data/Word, and found 21 images out of 100 with missing annotations ( 1 or up to 3 tables were missing). Unless I'm extremely lucky to catch these problematic annotations from the first 100 plot, this issue does not only exist in 'a little table'.

To be specific, I post the imgIds for those 21 images:

3, 9, 10, 27, 32, 33, 39, 47, 51, 56, 57, 58, 59, 60, 61, 62, 73, 76, 77, 87, 95

As for issue 2, I found 3 images (out of 100 tested images) with incorrect annotations:

18, 62, 83

I understand from the paper that these annotations are generated by parsing the PDF/Word documents, and those document parsing code could not catch all the tables. I post this here only for providing researchers some info that they might care about.

Possible Fix

Issue 1 is actually not hard to fix. I have trained a model for table detection (trained on other datasets) with descent performance, I'd like to use this model to run one pass through all the data provided here and hopefully spot a large amount of missing annotations, then fix those manually. I'd be happy to share and discuss more.

FYI

I load the data with pycocotools, get annotations for each images using:

img_ann = coco.loadAnns(coco.getAnnIds(imgIds = image_id))

and plotted the annotations on a matplotlib figure using

coco.showAnns(img_ann)

The missing/incorrect annotations were then spotted by eye.

I'd be happy to discuss more and provide the testing jpynb if wanted.

Best, Julian

charmichokshi commented 2 years ago

Hi @julianyulu

Do you manually check all the samples after running a table detection model for flagging the 'possible wrong annotations'? Or do you use loss or some metric as well to detect them automatically?