dfo-mar-odis / saraDataScraping

Repo to hold code and project management for the SARA data scraping project
MIT License
0 stars 0 forks source link

Review technology options - SARAdatascraping #5

Closed stoyelq closed 2 years ago

stoyelq commented 2 years ago

Proposed Change/Activity

Look into the available packages and code lanaguages available for parsing data out of files and evaluate their suitability for this workflow. Possible candidates include: python docx library, R libraries: docxtractr, and R-crawler.

Why is this important

There are multiple options to acomplish the desired data scraping so reviewing the options should ensure better results in the end product.

Additional Context

Any solution will likely need to work for both PDF and Word documents.