Database integration - Githubissues

redshiftzero commented 8 years ago

Right now we're generating a lot of data that gets stored across many small files. This data situation is quickly going to become a mess, so we should get more organized by having our data collection code - the sorter/crawler - automatically upload its measurements into the relevant tables in a database each time it runs. Given the amount of data we have, PostgreSQL should suffice. I propose we have a separate schema raw that will store the raw training examples. Features derived from these raw measurements can be stored in a separate schema features, and results from our classifier experiments should be uploaded into another schema ml. Here's a proposed initial design for this first schema raw for the measurement task we are focused on currently, collecting data from HS frontpages:

entity

The table frontpage_examples contains a row for every measurement of a given HS that we take with primary key exampleid. It links to the crawlers table with primary key crawlerid which describes information about the measurement conditions. The raw cell traces will be inserted into frontpage_traces and link back to frontpage_examples via exampleid. This structure enables us to very quickly select train/test sets in SQL with a couple of simple joins based on attributes we might be interested in: timestamp, url, crawler AS, sd_version, and so on.

psivesely commented 8 years ago

I think that some of this may be overcomplicated for what we're trying to achieve and does not follow https://blog.torproject.org/blog/ethical-tor-research-guidelines. Here's what the plan been so far (certainly up for discussion):

Collect traces (time, direction pairs) from every onion service website we can find and reach
Collect a bunch of traces from each up-to-date SD (our monitored class) and collect a dozen or so traces from every hidden services that is certainly not SD (our non-monitored class). All that needs to be known is which class a trace belongs to, the URL and all other info should never be saved.
Get a baseline for classifier accuracy
Implement various defenses for SD on VPSs we spin up
Collect data on these new defenses
Evaluate which are most effective

This process has been kept simple, not just to streamline it, but because I don't believe we need or should be collecting data such as (following through your arrows): cellid, circuit, stream, command, length (length is determined by command anyway because of padding, so is redundant); hs_url; or ip.

That said, I'm not totally against the idea of using a database to keep track of this stuff. In fact, I definitely see how this could be really useful, especially in a larger or more complicated project. I just don't want to overcomplicate the project and I barely know the first thing about databases myself. So you'd have to be in charge of implementing this (I know @conorsch mentioned he would be willing to help). Maybe you can come up with a revised designed, or we could discuss this in more detail via a call sometime this week?

redshiftzero commented 8 years ago

My understanding of the ethical guidelines are that one should not run a HSDir and harvest and republish onion addresses - which we of course we won't do - but that republishing say, an onion address found on a public listing like ahmia.fi would be considered ethical. With this in mind, we're restricting our project to only onion addresses found on public websites, taking measurements from our own traffic connecting to these addresses, and then saving these measurements into a database. Even if we were to release all the data I think it would be consistent with the ethical guidelines - let me know if there's something I'm missing here.

In fact, all the various fields in the frontpage_* tables (with the exception of the keys cellid, exampleid which are just there to ensure that each record in the table is unique) are already being saved by the data collection codes - including hs_url - they're just saved in a bunch of pickle and text files. I'm just advocating to organize this data into a database. However, if you think that some of these fields are definitely of no use then I'm happy to drop the field. Also I'm down to be in charge of implementing this, but let's discuss this further on a call :+1:

psivesely commented 8 years ago

My last post was a little hastily written as I was running late when writing it and wanted to send it out then. I'll try to maybe be a little more clear now just to so you have a better sense for the meeting tomorrow, where we can go over this all in more detail and work out a solid plan.

My intention was to emphasis three things mainly. First, that I want to play it on the safe side of data collection minimization practices. Second, that this problem ultimately needs to be solved on the protocol level--something that is being worked on--and that keeping ourselves pretty narrowly focused on as soon as possible, proving significant effectiveness of any practical server-side defense mechanism for SD is important--our team has limited person power and competing projects that I'd like to move more attention to in the coming months (not that this isn't very important). And lastly, that I have no next to no background with databases and am also leading this project, which makes me nervous might become a problem because I don't have very much free time to learn a new thing.

Again, I actually think this is a good issue to raise, and that the database will be worth it's investment; I just wanted to share some concerns I had regarding it. Excited to learn from you on this :bread:

psivesely commented 8 years ago

@redshiftzero and I had a good call about this and decided that database integration should ideally be built into the crawler, feature extractor, and classifier. I've been convinced that we can relax our data minimization practices a bit since this is our own traffic and since we will take steps (still debatable which exact steps, but that can be figured out later) to anonymize and minimize the data we do release. The one thing we really don't want to do is provide ready-to-go data and tools for a real-world bad actor. This increased scope of data collection will not only allow us greater introspection into the problem we face (and will help us answer many questions we might ask along the way), but is also useful (i) if we realize down the road that we weren't collecting some data (e.g., non-DATA tor cells) and (ii) for debugging.

She's done some minor redesign of the schema (see below) and we talked about how to rewrite some parts of the crawler with this in mind. The problems I still see with this schema are:

LENGTH is probably not helpful since Tor cells are padded to a fixed 512 byte size.
sd_version is not helpful since we can't implement defenses for people who don't upgrade--we're only interested in how the latest version does and what modifications we can make to it.
Crawlerid should maybe be the partial hash of the last commit that applied to the crawler code or there should be a different field for that if crawlerid is a necessary serializing field (remind me what pk and fk stand for?)
Country should be city--right now we have both SF and NYC in the US :fire: :us: :fire:--pretty far apart locations--also London :fire: :gb: :fire:

2016-07-01-170905_787x936_scrot

The crawler will be doing less processing of data--in fact the only reason it needs to do processing on a raw trace (meaning unmodified tor cell log from the time period that a trace was loading) is for future support of parallelization (see #9). To explain, if you have a raw trace you can see which rendezvous circuit(s) present in that trace are actually created during that time period of the trace in order to associate the correct ones with the URL being crawled. With parallelization, this wouldn't be possible, and we'd need additional data from stem in order to correlate rendezvous circuits to URLs--no problem. Since we're pretty much exclusively interested in the rendezvous circuit data anyway--the full trace is just for deeper introspection into tor process, debugging, etc.--we'll also create a -rc (rendezvous circuits) for each raw trace that contains just the rendezvous circuit cells associated with the URL we are trying to capture a trace from.

With that in mind here is roughly the plan for implementing database support:

[x] Fixup a few minor bugs in the refactor
[x] Change the crawler to spit out -raw and -rc traces that contain the full fields from the tor cell log--no more time normalization, processing, filtering, etc. will be necessary
[ ] Modify https://github.com/redshiftzero/fun-with-fingerprinting/blob/master/exploration/raw_util.py to implement the full schema and make sure things are working with these -raw and -rc traces the crawler is creating
[ ] Build the database code into the crawler, so that we can effortlessly build independent databses using the same schema that can then be seamlessly merged (this should be totes possible, right?)
[ ] Finish my implementation of Wa-kNN/ heavy re-factor and modification of @pylls code in https://github.com/fowlslegs/go-knn
[ ] Add in support to the feature extractor for the database using https://godoc.org/github.com/lib/pq
[ ] Add in support to the classifier for the database (again using pq)

If you see anything that needs to be added or corrected or you just want to follow up on the questions/ points I raised, etc., above do feel encouraged to respond @redshiftzero . We can talk about splitting up the work on other channels.

psivesely commented 8 years ago

Also, we should probably break that checklist up into individual separate issues that reference this one.

psivesely commented 8 years ago

Tor Browser version should be in there too.

psivesely commented 8 years ago

sd_version will in fact be useful as it will probably change through the course of our initial research and as we intend to use the framework we're developing as we move forward. My bad.

pk = private key fk = foreign key

Going to dig into the pandas documentation in the near future.

psivesely commented 8 years ago

wait_on_page and page_load_timeout should be captured in the "control file" (that's what I'm calling it) as well.

redshiftzero commented 8 years ago

(pk = primary key)

psivesely commented 8 years ago

Closing in favor of #25, #26, and https://github.com/fowlslegs/go-knn/issues/1.

freedomofpress / fingerprint-securedrop

Database integration #16