Closed redshiftzero closed 8 years ago
I think that some of this may be overcomplicated for what we're trying to achieve and does not follow https://blog.torproject.org/blog/ethical-tor-research-guidelines. Here's what the plan been so far (certainly up for discussion):
time, direction
pairs) from every onion service website we can find and reachThis process has been kept simple, not just to streamline it, but because I don't believe we need or should be collecting data such as (following through your arrows): cellid, circuit, stream, command, length
(length
is determined by command
anyway because of padding, so is redundant); hs_url
; or ip
.
That said, I'm not totally against the idea of using a database to keep track of this stuff. In fact, I definitely see how this could be really useful, especially in a larger or more complicated project. I just don't want to overcomplicate the project and I barely know the first thing about databases myself. So you'd have to be in charge of implementing this (I know @conorsch mentioned he would be willing to help). Maybe you can come up with a revised designed, or we could discuss this in more detail via a call sometime this week?
My understanding of the ethical guidelines are that one should not run a HSDir and harvest and republish onion addresses - which we of course we won't do - but that republishing say, an onion address found on a public listing like ahmia.fi would be considered ethical. With this in mind, we're restricting our project to only onion addresses found on public websites, taking measurements from our own traffic connecting to these addresses, and then saving these measurements into a database. Even if we were to release all the data I think it would be consistent with the ethical guidelines - let me know if there's something I'm missing here.
In fact, all the various fields in the frontpage_*
tables (with the exception of the keys cellid
, exampleid
which are just there to ensure that each record in the table is unique) are already being saved by the data collection codes - including hs_url
- they're just saved in a bunch of pickle and text files. I'm just advocating to organize this data into a database. However, if you think that some of these fields are definitely of no use then I'm happy to drop the field. Also I'm down to be in charge of implementing this, but let's discuss this further on a call :+1:
My last post was a little hastily written as I was running late when writing it and wanted to send it out then. I'll try to maybe be a little more clear now just to so you have a better sense for the meeting tomorrow, where we can go over this all in more detail and work out a solid plan.
My intention was to emphasis three things mainly. First, that I want to play it on the safe side of data collection minimization practices. Second, that this problem ultimately needs to be solved on the protocol level--something that is being worked on--and that keeping ourselves pretty narrowly focused on as soon as possible, proving significant effectiveness of any practical server-side defense mechanism for SD is important--our team has limited person power and competing projects that I'd like to move more attention to in the coming months (not that this isn't very important). And lastly, that I have no next to no background with databases and am also leading this project, which makes me nervous might become a problem because I don't have very much free time to learn a new thing.
Again, I actually think this is a good issue to raise, and that the database will be worth it's investment; I just wanted to share some concerns I had regarding it. Excited to learn from you on this :bread:
@redshiftzero and I had a good call about this and decided that database integration should ideally be built into the crawler, feature extractor, and classifier. I've been convinced that we can relax our data minimization practices a bit since this is our own traffic and since we will take steps (still debatable which exact steps, but that can be figured out later) to anonymize and minimize the data we do release. The one thing we really don't want to do is provide ready-to-go data and tools for a real-world bad actor. This increased scope of data collection will not only allow us greater introspection into the problem we face (and will help us answer many questions we might ask along the way), but is also useful (i) if we realize down the road that we weren't collecting some data (e.g., non-DATA tor cells) and (ii) for debugging.
She's done some minor redesign of the schema (see below) and we talked about how to rewrite some parts of the crawler with this in mind. The problems I still see with this schema are:
The crawler will be doing less processing of data--in fact the only reason it needs to do processing on a raw trace (meaning unmodified tor cell log from the time period that a trace was loading) is for future support of parallelization (see #9). To explain, if you have a raw trace you can see which rendezvous circuit(s) present in that trace are actually created during that time period of the trace in order to associate the correct ones with the URL being crawled. With parallelization, this wouldn't be possible, and we'd need additional data from stem
in order to correlate rendezvous circuits to URLs--no problem. Since we're pretty much exclusively interested in the rendezvous circuit data anyway--the full trace is just for deeper introspection into tor process, debugging, etc.--we'll also create a -rc
(rendezvous circuits) for each raw trace that contains just the rendezvous circuit cells associated with the URL we are trying to capture a trace from.
With that in mind here is roughly the plan for implementing database support:
-raw
and -rc
traces that contain the full fields from the tor cell log--no more time normalization, processing, filtering, etc. will be necessary-raw
and -rc
traces the crawler is creatingpq
)If you see anything that needs to be added or corrected or you just want to follow up on the questions/ points I raised, etc., above do feel encouraged to respond @redshiftzero . We can talk about splitting up the work on other channels.
Also, we should probably break that checklist up into individual separate issues that reference this one.
Tor Browser version should be in there too.
sd_version will in fact be useful as it will probably change through the course of our initial research and as we intend to use the framework we're developing as we move forward. My bad.
pk = private key fk = foreign key
Going to dig into the pandas documentation in the near future.
wait_on_page
and page_load_timeout
should be captured in the "control file" (that's what I'm calling it) as well.
(pk = primary key)
Closing in favor of #25, #26, and https://github.com/fowlslegs/go-knn/issues/1.
Right now we're generating a lot of data that gets stored across many small files. This data situation is quickly going to become a mess, so we should get more organized by having our data collection code - the sorter/crawler - automatically upload its measurements into the relevant tables in a database each time it runs. Given the amount of data we have, PostgreSQL should suffice. I propose we have a separate schema
raw
that will store the raw training examples. Features derived from these raw measurements can be stored in a separate schemafeatures
, and results from our classifier experiments should be uploaded into another schemaml
. Here's a proposed initial design for this first schemaraw
for the measurement task we are focused on currently, collecting data from HS frontpages:The table
frontpage_examples
contains a row for every measurement of a given HS that we take with primary keyexampleid
. It links to thecrawlers
table with primary keycrawlerid
which describes information about the measurement conditions. The raw cell traces will be inserted intofrontpage_traces
and link back tofrontpage_examples
viaexampleid
. This structure enables us to very quickly select train/test sets in SQL with a couple of simple joins based on attributes we might be interested in: timestamp, url, crawler AS, sd_version, and so on.