freedomofpress / fingerprint-securedrop

A machine learning data analysis pipeline for analyzing website fingerprinting attacks and defenses.
GNU Affero General Public License v3.0
29 stars 9 forks source link

Crawler is running into terminal connection refused socket failures #4

Open psivesely opened 8 years ago

psivesely commented 8 years ago

Edit: see https://github.com/freedomofpress/FingerprintSecureDrop/issues/4#issuecomment-228825080 for a better explanation and traceback. Don't know why this original report was so half-assed and lacked even the full traceback.

So the crawler is for the most part working very well. Where it runs into problems is what seems to be a Python IO/socket exception (Errno 111). Once it hits this error, it will fail the rest of the way through the crawl pretty instantaneously. See the log at the bottom of this post.

I believe that this is actually cause by a bug in Python3.5--see https://bugs.python.org/issue26402, but this warrants further testing. The PPA we've been using at https://launchpad.net/~fkrull/+archive/ubuntu/deadsnakes?field.series_filter=trusty has not seen an updated version of Python3.5 since December for Ubuntu 14.04 (trusty). This is about our only choice for newer Python versions, and I've already done the work to migrate this script to Python3.5, so we could use a single virtual environment for both the HS sorting and crawling scripts. Since at this point in our research we don't really need to run the sorting script, I think I'll just break compatibility with it by making the necessary changes in the ansible roles to install and use Python3.3 and that should hopefully fix things.

♫ Truckin' ♫
...
06:51:26 http://maghreb2z2zua2up.onion: exception: Remote end closed connection without response
06:51:26 http://radiohoodxwsn4es.onion: loading...
06:51:26 http://radiohoodxwsn4es.onion: exception: [Errno 111] Connection refused
06:51:26 http://tqjftqibbwtm4wmg.onion: loading...
06:51:26 http://tqjftqibbwtm4wmg.onion: exception: [Errno 111] Connection refused
06:51:26 http://newstarhrtqt6ua7.onion: loading...
06:51:26 http://newstarhrtqt6ua7.onion: exception: [Errno 111] Connection refused
...
And so on (fails through the rest of the URLs pretty instantly.

https://bugs.python.org/issue26402

psivesely commented 8 years ago

Testing https://github.com/freedomofpress/FingerprintSecureDrop/commit/5802bd36f84b4a51ab08a5ff6becaa25f2726f61 to address this.

psivesely commented 8 years ago

Crawls in progress. Will check on them tomorrow morning to see if they failed part-way through or not.

psivesely commented 8 years ago

@redshiftzero found that cubie3atuvex2gdw.onion, which redirects to https://another6nnp2ehkn.onion/ (self-signed cert) to have reproduced the error. I'm in the process of refactoring the crawler, but have a couple URLs to add to the "known to have crashed the crawler" list, I should add here soon. These might help in testing/ debugging this problem. There's also been a good amount of discussion of FPF's Slack that I should copy on over here about this bug and plans to figure it out.

psivesely commented 8 years ago

Copying my comments from external discussions about this:

Here's the breakdown of what happens: after establishing a connection to a peer on a socket that is bound to a local address, we send a well formed GET request to that peer (an onion service). If this remote end closes the connection without sending a response (i.e., the first line we try to read is empty), then http.client.RemoteDisconnected is raised. This exception in my crawler is caught here. (Realized I need to add a continue statement to the end of this except block and also it would be better to move the circuit cleanup code to the finally block, but anyway, I don't think this is the cause of the problem. I'm going to make a push to fix it, so we can see, but either way let me know what you think). After this error happens, the crawler does not seem to be able to recover. Instead, every site fails the same way.

What happens to the rest of the sites is as follows. A well-formed GET request is drafted and socket.connect() method is called to try to connect to the remote onion service. However, socket.connect() returns ConnectionRefused error, which corresponds to errno ECONNREFUSED and errno 111. It says in the docs that this happens when a connection attempt, is refused by the peer, but I don't think this is the case unless somehow the logic that is ensuring the GET request is well-formed gets screwed up by improper handling of the http.client.RemoteDisconnected exception in CPython. That doesn't seem to be an issue though because a new instance of the HTTPConnection class is created for each connection... so I feel like I need to look at how sockets are implemented in CPython to figure out what's going wrong.

# The crawler keeps crashing after it hits exception 1 single time. Then it hits
# exception 2 for every single connection thereafter. I've broken down the
# traceback with comments. They both start off the same for the first 4 lines or
# so where selenium works out the calls it's going to make to the standard Python
# libraries.
​
# ~*~*~*~* 1 *~*~*~*~
​
"./crawl_hidden_services.py", line 163, in crawl_class
    driver.get(url)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 245, in get
    self.execute(Command.GET, {\'url\': url})
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 231, in execute
    response = self.command_executor.execute(driver_command, params)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
    return self._request(command_info[0], url, body=data)
​
# RemoteConnection._request() sends an HTTP request to the remote server.
# self.keepalive is true so we send that in our request. keep_alive being true
# also means we set self._conn = httplib.HTTPConnection(args) (line 188) in the
# __init__ of our RemoteConnection object instance.
​
# The request goes through okay, so self.__state should be _CS_REQ_SENT.
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 426, in _request
    resp = self._conn.getresponse()
​
# This is really calling httplib.HTTPConnection.getresponse(). (Note: debug
# option present here). response = HTTPResponse(self.sock, [self.debuglevel,]
# method=self._method). Then we call the begin() method on our HTTPResponse object.
​
"/usr/lib/python3.5/http/client.py", line 1174, in getresponse
    response.begin()
​
# self.headers is not None, so we continue. The first call we make is to
# self._read_status(), to try to read the first line, which should include the
# status information.
​
"/usr/lib/python3.5/http/client.py", line 282, in begin
    version, status, reason = self._read_status()
​
# self._read_status() tries to read the first line of the response, but it is
# empty, so we assume that the remote end closed the connection without a
# response. It shouldn't be able to know we're a crawler because we're using Tor
# Browser...
​
"/usr/lib/python3.5/http/client.py", line 251, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
​
', 'http.client.RemoteDisconnected: Remote end closed connection without response
​
# "A subclass of ConnectionResetError and BadStatusLine. Raised by
# HTTPConnection.getresponse() when the attempt to read the response results in no
# data read from the connection, indicating that the remote end has closed the
# connection."
​
# ~*~*~*~* 2 *~*~*~*~
​
"./crawl_hidden_services.py", line 163, in crawl_class
    driver.get(url)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 245, in get
    self.execute(Command.GET, {\'url\': url})
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 231, in execute
    response = self.command_executor.execute(driver_command, params)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
    return self._request(command_info[0], url, body=data)
​
"/home/noah/FingerprintSecureDrop/lib/python3.5/site-packages/selenium/webdriver/remote/remote_connection.py", line 425, in _request
    self._conn.request(method, parsed_url.path, body, headers)
​
# httplib.HTTPConnection.request() wraps self._send_request()
​
"/usr/lib/python3.5/http/client.py", line 1083, in request
    self._send_request(method, url, body, headers)
​
# This methods calls self.putrequest(), which may warrant further investigation
# (line 915). And then call self.endheaders()
​
"/usr/lib/python3.5/http/client.py", line 1128, in _send_request
    self.endheaders(body)
​
# This method sends the request to the server. The state is set to
# _CS_REQ_STARTED and then self._send_output() is called
​
"/usr/lib/python3.5/http/client.py", line 1079, in endheaders
    self._send_output(message_body)
​
# This method calls self.send()
​
# You might also ensure self.debuglevel > 0 for more information, if it proves
# necessary.  Then the data is read by the block and sent with
# self.sock.sendall(datablock).
​
"/usr/lib/python3.5/http/client.py", line 911, in _send_output
    self.send(msg)
​
# which first may auto_open a socket. To do so it calls self.connect()
​
"/usr/lib/python3.5/http/client.py", line 854, in send
    self.connect()
​
#  which sets self.sock = self._create_connection()
​
"/usr/lib/python3.5/http/client.py", line 826, in connect
    (self.host,self.port), self.timeout, self.source_address)
​
# which is defined as socket.create_connection in the class instance __init__
# block. socket.create_connection connects to an address and returns the socket
# object. Note self.source_address is defined in our HTTPConnection object, so
# socket.create_connection will try to bind as a source address before making the
# connection. First socket.getaddrinfo() is called on the address, which is a tuple:
# (self.host, self.port). socket.getaddrinfo() translates the host/port arg into a
# sequence of 5-tuples (host, port, family, type, proto, flag) that contain all
# the necessary args for creating a socket connected to that service.
​
"/usr/lib/python3.5/socket.py", line 711, in create_connection
    raise err
​
# After successfully binding to the local address w/ socket.bind(),
# socket.create_connection() tries to connect to the peer with socket.connect()
​
"/usr/lib/python3.5/socket.py", line 702, in create_connection
    sock.connect(sa)
​
# This method fails--the socket class is built-in, so can't debug further.
​
', 'ConnectionRefusedError: [Errno 111] Connection refused
​
# "A subclass of ConnectionError, raised when a connection attempt is refused by
# the peer. Corresponds to errno ECONNREFUSED." Since this is a builtin
# exception, we can't get much more info about it

URLs known to be causing the problem: http://money2mxtcfcauot.onion and http://22222222aziwzse2.onion. (More, but I was negligent in saving them.)

psivesely commented 8 years ago

One idea is to basically restart Tor and Tor Browser when this happens. It's a hack, but it isn't my fault that one can't just except and continue this error, and finding/ resolving it upstream has proved to be quit difficult. I'm in the process of implementing that for the refactored crawl_onions.py.