VizierDB / vizier-scala

The Vizier kernel-free notebook programming environment
Other
34 stars 11 forks source link

PDF Table Extraction #241

Open okennedy opened 1 year ago

okennedy commented 1 year ago

What pain point is this feature intended to address? Please describe. Data often, irritatingly, lives in PDF files.

Describe the solution you'd like

Proposed workflow:

  1. Drag the PDF onto a cell or enter a URL.
  2. The PDF is up/downloaded as a Vizier file (created by the cell itself) and the cell enters a selection UI (e.g., by embedding a PDF).
  3. The user navigates to a page of the PDF in the cell (optional)
  4. The user selects an area of the PDF (optional)
  5. The user names the table (optional)
  6. The user repeats steps 3-6 for additional tables
  7. The user clicks run
  8. The workflow uses something like tabula to extract tables.

Describe alternatives you've considered Several commercial tools provide this sort of extraction, or tabula can be used as a command-line... but in both cases there is a non-provenance-tracked separation between the data source and the data obtained from it.