freedomofpress / fingerprint-securedrop

A machine learning data analysis pipeline for analyzing website fingerprinting attacks and defenses.
GNU Affero General Public License v3.0
29 stars 9 forks source link

Fixes for crawler and sorter #95

Closed redshiftzero closed 7 years ago

redshiftzero commented 7 years ago

Fix a couple of issues with the sorter and crawler:

  1. It looks like the reason the sorter was immediately exiting is that in 0a6b52538d121e58982c68b7042f90b526e8dd53 the delimiter between directories in config.ini was changed to \n but the delimiter the sorter was expecting is , so it was no longer parsing any directory URLs.
  2. config.ini was not being populated with the prod db values since Ansible was not setting up config.ini. Made a minor change to read from ~/.pgpass (since Ansible is setting that up on the dev VM and the VPSes). This also closes #93 and closes #94
coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 72.727% when pulling 01e2f93c1adbd720cb440208178e51fd5f15441c on hotfix-sorter-crawler into d3859760d1ec66d6729565b9870a240006d37361 on master.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 72.727% when pulling 17ecf152171c6ed44ebee5be6be408da0c74f143 on hotfix-sorter-crawler into d3859760d1ec66d6729565b9870a240006d37361 on master.

redshiftzero commented 7 years ago

My goal wasn't to remove the configuration section, but to get the provisioning sorted out so we can get the crawlers working again. Will this PR get the crawlers running and dumping data into the right database the way the VPSes are currently being provisioned? If yes let's get this merged so we can continue crawling. Both of the above assumptions are true in the environments we are running these things: vagrant dev VM, Travis, and on the VPSes, no?

psivesely commented 7 years ago

Although these assumptions may be correct in the present moment, this requires that we don't make any changes that break those assumptions

I think there are easier ways to get things re-running by EOD (as that is our expressed goal) that rely on less assumptions. Namely, I still believe we should hard-code the test database credentials in TestDatabase and leave management of config.ini production database credentials and PGPASSFILE to the user after the initial configuration. Should take 5 minutes to write and run a custom tasklist to set the correct production values in the config.ini files of the VPSs. Should not need to be modified again. Making the permanent changes to our playbook ensure that these values are set correctly on first provision for Travis, Vagrant, and if/when we spin up new instances.

What do you think @conorsch?

conorsch commented 7 years ago

Applied this feature branch to the prod crawlers. @redshiftzero Can you confirm working sorting and crawling? If so, I vote we merge.

Namely, I still believe we should hard-code the test database credentials in TestDatabase and leave management of config.ini production database credentials and PGPASSFILE to the user after the initial configuration.

Open a separate issue for that, since this PR includes the test regression of snipping out the db tests to keep the merge-wheels turnin'.

Should take 5 minutes to write and run a custom tasklist to set the correct production values in the config.ini files of the VPSs.

Sounds to me like the config.ini file should be converted to full template, so we can set sane defaults (e.g. for use in Travis), and override via vars in any other environment (i.e. prod).

redshiftzero commented 7 years ago

Sorter is working on the VM and on the VPSes (just finished another sort), database connections are working now everywhere, but I'm running into tor issues on crawling on the VPSes. I say :+1: to merge this since nothing in this PR is changing anything in the crawler to do with Tor (i.e. these issues are in master)

conorsch commented 7 years ago

Thanks for confirming functionality, @redshiftzero! Going to merge as-is.

I'm running into tor issues on crawling on the VPSes.

If you have a traceback, toss into a separate issue!

redshiftzero commented 7 years ago

yep will construct a coherent description of what is going on and make an issue, thanks @conorsch