fcfort / betterment-csv-chrome

Betterment CSV export Chrome extension
https://chrome.google.com/webstore/detail/betterment-pdf-to-csv-exp/jbneodpofmnammepmnejgkacdbjojcgn
Apache License 2.0
10 stars 4 forks source link

Can stream/lattice-based PDF table extractor be used instead? #58

Open fcfort opened 5 years ago

fcfort commented 5 years ago

This project currently uses a homegrown hacky line-by-line PDF to text conversion in order to extract transaction data from Betterment PDFs. Would be much better if there we could use a more robust library designed to extract tabular data from PDFs.

See https://tomassetti.me/how-to-convert-a-pdf-to-excel/ along with Python impl (https://github.com/camelot-dev/camelot) and Java impl (https://github.com/tabulapdf/tabula-java).

One small problem is that there is no JS implementation. One possibility is to use a Java to JS transpiler, e.g. https://github.com/cincheo/jsweet or https://github.com/google/j2cl.

fcfort commented 5 years ago

Next steps:

  1. Implement prototype in Java using tabula-java to take Betterment PDF and extract transaction tables.
  2. Implement prototype taking tabula-java Java lib and calling from Chrome extension in the browser, i.e. test the java-lib -> JS -> Browserify -> Chrome extension pathway.
  3. Implement separate Java lib to do table extraction
  4. Integrate importing of lib from previous step, transpiling and including in Chrome extension app.
  5. Migrate over app to use new table extraction lib
  6. Deprecate old way of extracting data.