Closed redshiftzero closed 8 years ago
These new commits implement issue #25, also integrating the crawler with the database.
These are new config options for the crawler:
• use_database
: select crawler onions from the database instead of using the pickle file, and insert traces back into the database. if False, they will be saved as before in text files
• hs_history_lookback
: define the timespan from which to pick the onions to crawl (e.g. those HS seen up as of the last hour, last day, last week?). Some helper functions and corresponding tests are added to utils.py
and test/test_utils.py
to handle this.
What happens in crawler.py
when use_database
is set to True:
• In Crawler.__init__
, the crawler will insert a row into raw.crawls
storing some general information about this crawl, such as the os and entry node that is used. The database insert is done in RawStorage.add_crawl()
in database.py
.
• In Crawler.collect_onion_trace()
, the crawler will insert a row corresponding to a given trace into frontpage_examples
and will then bulk insert each measured cell into frontpage_traces
.
• Note that class_data
will still be an OrderedDict
(to maintain the functionality of the pickle file). However, it will now contain two dicts for the non-monitored and monitored classes. For comparison, when the crawler uses a pickle file, the form of class_data
is an OrderedDict
with two sets for the non-monitored and monitored classes. The logic in crawl_monitored_nonmonitored_classes
is changed slightly to handle these two cases.
Can you rebase this on master? Also, I see you added psycopg2
to sorter-requirements.in
, but don't import it anywhere?
ah, psycopg2
is the Python DBAPI for PostgreSQL that SQLalchemy needs to work with the database.
Okay, so it's listed as an optional and not hard requirement of the SQLAlchemy package, and while we don't use it directly, we can't communicate w/ the Postgres DB unless we have it? Am I getting that right?
yep, that's correct
Rebased on master for review
I just automated the setup of the database for testing. Can you try to run it for me and then re-provision as well @redshiftzero:
vagrant destroy -f
vagrant up
vagrant provision
Be sure to pull first and also follow the instructions I just added to the README on setting up a virtualenv from which to run the vagrant commands, as you need specific Python and Ansible versions for everything to work.
I followed these instructions and re-provisioned: the database seems to be getting set up with no problems in the VM 🎆
Notes on individual commits:
Some other notes:
I meant to figure out a solution to this before pushing, but in my last commit here, I realized that if you try to re-provision a VM after reboot /tmp/passwordfile
will be gone, so it will be re-generated and then Ansible will generate a new password which will get saved to the /var/lib/postgresql/pgpass.conf
. That password won't work and will have erased the working one. Rather than worry about storing it permanently on the controller, which would become too complicated given the possibility of deploying to multiple VPSs, we should use a when
statement to not re-generate a password when pgpass.conf
is present.
Another thing to figure out is how to support provisioning a VM that doesn't create it's own local SQL server, but instead is configured to write to a remote database.
Thanks for the feedback - responding bullet-by-bullet:
3d94ceb should be its own PR.
I see that you merged in 3d94ceb separately so we are good there.
a1324ec should either be removed or they should be deleted and then melded into d340732 using fixup during a rebase.
I’m assuming that the database files in a1324ec should be removed because they’ve now been committed under roles/database/files
. I’ve removed the entire db
directory since this is being done by Ansible. I can add back creating the features
schema at a later time.
Not sure if a7d6ae2 is needed. Maybe we can just add some info on setting env vars in the README or just Ansible-ize it?
I see your point here, I’ve removed this file and rewritten RawStorage().__init__()
to directly use the environmental variables. I added a note to README.md that these environmental variables must be set in order to use the database.
I just barely modified a7dbe9b (now 3ee61fa) in a rebase.
Looks good.
Something unrelatedish snuck into c17c617. I think it's okay to leave in, but would be neater to be it's own commit or melded into a1324ec.
Fixed
We should write more for the README on the database and include that dank Graphviz graph you made
I wrote a little more on the README describing the database and put in the graphviz figure. I hope it is “dank” 🏄
So /etc/profile
and /etc/profile.d/*
are only sourced by login shells. So you need to do sudo -Es && su postgres
to keep the env vars if you need to operate as the postgres user with the way things are set up currently. Since we shouldn't need to support other shells, it's probably better that these are set in /etc/bash.bashrc
so that they're always accessible to all bash users. Since there is already stuff in the there though, and we want to use a template for the file, it might make sense to leave the Nevermind, I just thought of a way you can do this just using /etc/profile.d/fpsd-database.sh
script where it is and then ensure the line source /etc/profile.d/fpsd-database.sh
is present in `/etc/bash.bashrc.lineinfile
.
Things to change: just the tests written for get_lookback()
Implements #26. Adds options in
config.ini
to upload the onions into the database when the sorter runs. One can also turn theupload_to_db
option to false to run the sorter as before without uploading the data into the database (which one should do if one is testing things). Either way, the onions will also be saved inlogging
in a pickle file.This PR also adds SQL scripts in
fpsd/db
to setup the database. There is a graphviz diagram showing the database design indoc/images
.