Populate raw schema in development VM

redshiftzero commented 7 years ago

This PR adds the population of the raw schema to the Ansible playbook that provisions the development VM. This is done such that the feature generation and machine learning codes can be run more easily in the development VM - i.e. without having to run the crawler (can take a while) or connect to the production database (A Bad Idea). The data that is populating the raw schema in the VM is derived from our real data (with some anonymization). I also add the notebook where I construct this dataset for future reference / modification.

Upon request, I have also created a version of the data used to populate each individual table here for people to play with in a single file roles/crawler/files/raw-data/test_data.csv without needing to worry about joins.

This PR also bumps the version of Tor Browser since our download link in the Ansible play was old and the download link was 404ing

coveralls commented 7 years ago

Coverage remained the same at 72.727% when pulling 5a73245e2381709b005ca0a6bd6fabe17f647062 on populate-raw-schema-in-vm into b183c0c623763b1c244b5617f126ba1be7a4bd53 on master.

coveralls commented 7 years ago

Changes Unknown when pulling d8c2c2d5feff62b1a7ce714d29959df51f29b2e6 on populate-raw-schema-in-vm into on master.

freedomofpress / fingerprint-securedrop

Populate raw schema in development VM #89