UW-COSMOS / Cosmos

Knowledge base construction from raw scientific documents
37 stars 16 forks source link

Table extraction improvements with pdfplumber #175

Closed ryansun117 closed 1 year ago

ryansun117 commented 1 year ago

Added pdfplumber to table_extraction.py while preserving the existing Camelot table extraction functionality.

The pdfplumber coordinate system should be the same as COSMOS's pdf coordinate system, except that the coordinates need to be scaled from the size of an image to maybe around 792x612.

Tested the added functions locally with manual pdf coordinate inputs, but not within COSMOS since TableLocationProcessor requires COSMOS generated pngs of detected tables

iross commented 1 year ago

One more tiny need: could you add pdfplumber to the requirements.txt?