eebbesen / hbcb_rails

Cool data at, but it is stuck in pdfs
MIT License
0 stars 1 forks source link


Cool data at, but it is stuck in pdfs.

This program downloads the pdfs, converts them to text documents and slurps the information into a searchable database.

Data userss

I have scrubbed data manually and via regex so that it can be used as a database.

Setup application

bundle install                # initialize app
bundle exec rake db:migrate   # create database tables
bundle exec rake db:data:load # load data
bundle exec rails s           # run rails server

Then use a web browser to visit http://localhost:3000

Persist manual changes to data

If you find data that needs to be formatted better/differently, do so in the application and save changes:

bundle exec rake db:data:dump


To run conversions

Download pdfs

gem install nokogiri
ruby lib/download_pdfs.rb <start_letter>

You can include a start_letter if you've already partially downloaded the files.

You can simulate a run that doesn't download any files by setting environment variable DRY_RUN, e.g.

DRY_RUN=1 ruby lib/download_pdfs.rb <start_letter>

Convert pdfs to text

brew update
brew install xpdf

Parse and persist bios and postings

To process all of the files matching test/fixtures/*.txt

bundle exec rake slurp

To process a single file

bundle exec rake slurp[/absolute/path/to/project/test/fixtures/adan_charles.txt]


The value of this code (to me) is a working datastore so development is proceeding 'fast and loose'. One of the many compromises is manually cleaning up data that isn't properly parsed/converted by the default (already ugly, imho) regular expressions.