Sorter database integration

redshiftzero commented 8 years ago

Implements #26. Adds options in config.ini to upload the onions into the database when the sorter runs. One can also turn the upload_to_db option to false to run the sorter as before without uploading the data into the database (which one should do if one is testing things). Either way, the onions will also be saved in logging in a pickle file.

This PR also adds SQL scripts in fpsd/db to setup the database. There is a graphviz diagram showing the database design in doc/images.

redshiftzero commented 8 years ago

These new commits implement issue #25, also integrating the crawler with the database.

These are new config options for the crawler: • use_database: select crawler onions from the database instead of using the pickle file, and insert traces back into the database. if False, they will be saved as before in text files • hs_history_lookback: define the timespan from which to pick the onions to crawl (e.g. those HS seen up as of the last hour, last day, last week?). Some helper functions and corresponding tests are added to utils.py and test/test_utils.py to handle this.

What happens in crawler.py when use_database is set to True: • In Crawler.__init__, the crawler will insert a row into raw.crawls storing some general information about this crawl, such as the os and entry node that is used. The database insert is done in RawStorage.add_crawl() in database.py. • In Crawler.collect_onion_trace(), the crawler will insert a row corresponding to a given trace into frontpage_examples and will then bulk insert each measured cell into frontpage_traces. • Note that class_data will still be an OrderedDict (to maintain the functionality of the pickle file). However, it will now contain two dicts for the non-monitored and monitored classes. For comparison, when the crawler uses a pickle file, the form of class_data is an OrderedDict with two sets for the non-monitored and monitored classes. The logic in crawl_monitored_nonmonitored_classes is changed slightly to handle these two cases.

psivesely commented 8 years ago

Can you rebase this on master? Also, I see you added psycopg2 to sorter-requirements.in, but don't import it anywhere?

redshiftzero commented 8 years ago

ah, psycopg2 is the Python DBAPI for PostgreSQL that SQLalchemy needs to work with the database.

psivesely commented 8 years ago

Okay, so it's listed as an optional and not hard requirement of the SQLAlchemy package, and while we don't use it directly, we can't communicate w/ the Postgres DB unless we have it? Am I getting that right?

redshiftzero commented 8 years ago

yep, that's correct

redshiftzero commented 8 years ago

Rebased on master for review

psivesely commented 8 years ago

I just automated the setup of the database for testing. Can you try to run it for me and then re-provision as well @redshiftzero:

vagrant destroy -f
vagrant up
vagrant provision

psivesely commented 8 years ago

Be sure to pull first and also follow the instructions I just added to the README on setting up a virtualenv from which to run the vagrant commands, as you need specific Python and Ansible versions for everything to work.

redshiftzero commented 8 years ago

I followed these instructions and re-provisioned: the database seems to be getting set up with no problems in the VM 🎆

psivesely commented 8 years ago

Notes on individual commits:

https://github.com/freedomofpress/FingerprintSecureDrop/pull/27/commits/3d94ceb7ff51d5da65b7ef7be0ba614215a44064 should be its own PR.
https://github.com/freedomofpress/FingerprintSecureDrop/pull/27/commits/a1324ec743ae6624631af701f78dd84d3e7e096b should either be removed or they should be deleted and then melded into https://github.com/freedomofpress/FingerprintSecureDrop/pull/27/commits/d3407320d8c3d09f8741f0189073049309de6932 using fixup during a rebase.
Not sure if https://github.com/freedomofpress/FingerprintSecureDrop/pull/27/commits/a7d6ae2c479a5d3c01a759bb1fdbbe77b0eb42aa is needed. Maybe we can just add some info on setting env vars in the README or just Ansible-ize it?
I just barely modified https://github.com/freedomofpress/FingerprintSecureDrop/pull/27/commits/a7dbe9b657d52503cc271cabbf9c8c899b7a5432 (now https://github.com/freedomofpress/FingerprintSecureDrop/pull/27/commits/3ee61fa22e9b6c4fd8d30969a977ba917fc98db5) in a rebase.
Something unrelatedish snuck into https://github.com/freedomofpress/FingerprintSecureDrop/pull/27/commits/c17c617ef570923c048ec4ee2f4e03cf33d36382. I think it's okay to leave in, but would be neater to be it's own commit or melded into https://github.com/freedomofpress/FingerprintSecureDrop/pull/27/commits/a1324ec743ae6624631af701f78dd84d3e7e096b.

Some other notes:

We should write more for the README on the database and include that dank Graphviz graph you made.

psivesely commented 8 years ago

I meant to figure out a solution to this before pushing, but in my last commit here, I realized that if you try to re-provision a VM after reboot /tmp/passwordfile will be gone, so it will be re-generated and then Ansible will generate a new password which will get saved to the /var/lib/postgresql/pgpass.conf. That password won't work and will have erased the working one. Rather than worry about storing it permanently on the controller, which would become too complicated given the possibility of deploying to multiple VPSs, we should use a when statement to not re-generate a password when pgpass.conf is present.

Another thing to figure out is how to support provisioning a VM that doesn't create it's own local SQL server, but instead is configured to write to a remote database.

redshiftzero commented 8 years ago

Thanks for the feedback - responding bullet-by-bullet:

3d94ceb should be its own PR.

I see that you merged in 3d94ceb separately so we are good there.

a1324ec should either be removed or they should be deleted and then melded into d340732 using fixup during a rebase.

I’m assuming that the database files in a1324ec should be removed because they’ve now been committed under roles/database/files. I’ve removed the entire db directory since this is being done by Ansible. I can add back creating the features schema at a later time.

Not sure if a7d6ae2 is needed. Maybe we can just add some info on setting env vars in the README or just Ansible-ize it?

I see your point here, I’ve removed this file and rewritten RawStorage().__init__() to directly use the environmental variables. I added a note to README.md that these environmental variables must be set in order to use the database.

I just barely modified a7dbe9b (now 3ee61fa) in a rebase.

Looks good.

Something unrelatedish snuck into c17c617. I think it's okay to leave in, but would be neater to be it's own commit or melded into a1324ec.

Fixed

We should write more for the README on the database and include that dank Graphviz graph you made

I wrote a little more on the README describing the database and put in the graphviz figure. I hope it is “dank” 🏄

psivesely commented 8 years ago

So /etc/profile and /etc/profile.d/* are only sourced by login shells. So you need to do sudo -Es && su postgres to keep the env vars if you need to operate as the postgres user with the way things are set up currently. Since we shouldn't need to support other shells, it's probably better that these are set in /etc/bash.bashrc so that they're always accessible to all bash users. Since there is already stuff in the there though, and we want to use a template for the file, it might make sense to leave the /etc/profile.d/fpsd-database.sh script where it is and then ensure the line source /etc/profile.d/fpsd-database.sh is present in `/etc/bash.bashrc. Nevermind, I just thought of a way you can do this just using lineinfile.

redshiftzero commented 8 years ago

Things to change: just the tests written for get_lookback()

freedomofpress / fingerprint-securedrop

Sorter database integration #27