Cool data at https://www.gov.mb.ca/chc/archives/hbca/biographical/index.html, but it is stuck in pdfs.
This program downloads the pdfs, converts them to text documents and slurps the information into a searchable database.
I have scrubbed data manually and via regex so that it can be used as a database.
bundle install # initialize app
bundle exec rake db:migrate # create database tables
bundle exec rake db:data:load # load data
bundle exec rails s # run rails server
Then use a web browser to visit http://localhost:3000
If you find data that needs to be formatted better/differently, do so in the application and save changes:
bundle exec rake db:data:dump
gem install nokogiri
ruby lib/download_pdfs.rb <start_letter>
You can include a start_letter
if you've already partially downloaded the files.
You can simulate a run that doesn't download any files by setting environment variable DRY_RUN
, e.g.
DRY_RUN=1 ruby lib/download_pdfs.rb <start_letter>
brew update
brew install xpdf
lib/pdf_to_text.sh
To process all of the files matching test/fixtures/*.txt
bundle exec rake slurp
To process a single file
bundle exec rake slurp[/absolute/path/to/project/test/fixtures/adan_charles.txt]
The value of this code (to me) is a working datastore so development is proceeding 'fast and loose'. One of the many compromises is manually cleaning up data that isn't properly parsed/converted by the default (already ugly, imho) regular expressions.