PanDAWMS / dkb

Data Knowledge Base for HENP experiments
0 stars 2 forks source link

Table descriptions above tables #3

Open Evildoor opened 7 years ago

Evildoor commented 7 years ago

PDF Analyzer's table processing algorithm includes detection of table description and separation of table lines from all other lines. These procedures work on assumption that table description is positioned below the table: proper_table However, some documents can position descriptions above tables or even mix both kinds of positioning. PDF Analyzer either fails to extract such tables or extracts them incorrectly.

Document examples: CDS_CERN-ATL-COM-PHYS-2016-135, page 13.

Evildoor commented 6 years ago

Note: it seems that term "caption" rather than "description" or "header" is often used.

Evildoor commented 6 years ago

Some work was done on this (see 5e149e75d48280754bf8e31f916f8970c1d17415). As usual, there is much to improve - however, I should highlight the fact that measuring the position of main text strings may cause problems with rotated pages. This should be looked into.