This PR adds the population of the raw schema to the Ansible playbook that provisions the development VM. This is done such that the feature generation and machine learning codes can be run more easily in the development VM - i.e. without having to run the crawler (can take a while) or connect to the production database (A Bad Idea). The data that is populating the raw schema in the VM is derived from our real data (with some anonymization). I also add the notebook where I construct this dataset for future reference / modification.
Upon request, I have also created a version of the data used to populate each individual table here for people to play with in a single file roles/crawler/files/raw-data/test_data.csv without needing to worry about joins.
This PR also bumps the version of Tor Browser since our download link in the Ansible play was old and the download link was 404ing
Coverage remained the same at 72.727% when pulling 5a73245e2381709b005ca0a6bd6fabe17f647062 on populate-raw-schema-in-vm into b183c0c623763b1c244b5617f126ba1be7a4bd53 on master.
This PR adds the population of the raw schema to the Ansible playbook that provisions the development VM. This is done such that the feature generation and machine learning codes can be run more easily in the development VM - i.e. without having to run the crawler (can take a while) or connect to the production database (A Bad Idea). The data that is populating the raw schema in the VM is derived from our real data (with some anonymization). I also add the notebook where I construct this dataset for future reference / modification.
Upon request, I have also created a version of the data used to populate each individual table here for people to play with in a single file
roles/crawler/files/raw-data/test_data.csv
without needing to worry about joins.This PR also bumps the version of Tor Browser since our download link in the Ansible play was old and the download link was 404ing