Open Evildoor opened 7 years ago
Note: it seems that term "caption" rather than "description" or "header" is often used.
Some work was done on this (see 5e149e75d48280754bf8e31f916f8970c1d17415). As usual, there is much to improve - however, I should highlight the fact that measuring the position of main text strings may cause problems with rotated pages. This should be looked into.
PDF Analyzer's table processing algorithm includes detection of table description and separation of table lines from all other lines. These procedures work on assumption that table description is positioned below the table: However, some documents can position descriptions above tables or even mix both kinds of positioning. PDF Analyzer either fails to extract such tables or extracts them incorrectly.
Document examples: CDS_CERN-ATL-COM-PHYS-2016-135, page 13.