Too many consecutive errors

andresriancho commented 9 years ago

How to reproduce

Was found while trying to reproduce https://github.com/andresriancho/w3af/issues/4219 :

w3af running on my workstation
DVWA running on virtualbox

I'm able to reproduce this issue in each scan

IMPORTANT The following error was detected by w3af and couldn't be resolved: w3af found too many consecutive errors while performing HTTP requests. In most cases this means that the remote web server is not reachable anymore, the network is down, or a WAF is blocking our tests. The last exception message was "HTTP timeout error." (URLTimeoutError).

Scan details

Log: https://gist.github.com/andresriancho/5c5d5a3cbd324e1fe3d6

Pcap: https://www.dropbox.com/s/iotw36jyjmn6d12/w3af-failing-scan.pcap?dl=1
Specific timeout analysis

In the logs:

[jue 05 mar 2015 07:05:37 ART - debug] Re-sending request "<HTTPRequest "http://10.5.6.28/dvwa/vulnerabilities/fi/?page=htTps://w3af.org/" (cookies:True, cache:True)>" after initial exception: "HTTP timeout error."
[jue 05 mar 2015 07:05:37 ART - debug] Traceback for this error: Traceback (most recent call last):
  File "/home/pablo/pch/w3af/w3af/core/data/url/extended_urllib.py", line 575, in _send
    res = self._opener.open(req)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/home/pablo/pch/w3af/w3af/core/data/url/handlers/keepalive/__init__.py", line 487, in http_open
    return self.do_open(req)
  File "/home/pablo/pch/w3af/w3af/core/data/url/handlers/keepalive/__init__.py", line 353, in do_open
    raise URLTimeoutError()
URLTimeoutError: HTTP timeout error.

In the pcap file (note the time column, where we can clearly see that the remote end isn't answering the request and then we send the FIN packet) (tcp.stream eq 87):

close

The feature I added for preventing errors like this isn't working as expected, it escalates quickly from waiting 0.15 seconds to 1.5, which is not what I wanted

[pablo:/home/pablo/pch/w3af] develop 130 ± grep 'seconds before sending HTTP' /tmp/output-errors.txt 
[jue 05 mar 2015 07:05:37 ART - debug] Sleeping for 0.15 seconds before sending HTTP request after receiving URL/socket error. The total error count is at 1.
[jue 05 mar 2015 07:05:38 ART - debug] Sleeping for 1.5 seconds before sending HTTP request after receiving URL/socket error. The total error count is at 10.
[jue 05 mar 2015 07:05:39 ART - debug] Sleeping for 1.5 seconds before sending HTTP request after receiving URL/socket error. The total error count is at 10.
[jue 05 mar 2015 07:05:40 ART - debug] Sleeping for 1.5 seconds before sending HTTP request after receiving URL/socket error. The total error count is at 10.
[jue 05 mar 2015 07:05:42 ART - debug] Sleeping for 1.5 seconds before sending HTTP request after receiving URL/socket error. The total error count is at 10.
[jue 05 mar 2015 07:05:43 ART - debug] Sleeping for 1.5 seconds before sending HTTP request after receiving URL/socket error. The total error count is at 10.
[pablo:/home/pablo/pch/w3af] develop ±

Profile

[profile]
description = output_html_error
name = test_error

[grep.error_pages]

[crawl.web_spider]
only_forward = True
follow_regex = .*dvwa.*
ignore_regex = .*logout.*|.*setup.*|.*security\.php.*|^((?!dvwa).)*$

[auth.detailed]
username = user
password = user
username_field = username
password_field = password
auth_url = http://10.5.6.18/dvwa/login.php
check_url = http://10.5.6.18/dvwa/index.php
check_string = Welcome
data_format = %u=%U&%p=%P&Login=Login
method = POST

[output.html_file]
output_file = /tmp/w3af_output_html_utf8_error.html
verbose = True

[output.console]
verbose = True

[target]
target = http://10.5.6.18/dvwa/

[misc-settings]
fuzz_cookies = False
fuzz_form_files = True
fuzz_url_filenames = False
fuzz_url_parts = False
fuzzed_files_extension = gif
fuzzable_headers = 
form_fuzzing_mode = tmb
stop_on_first_exception = False
max_discovery_time = 120
interface = ppp0
local_ip_address = 10.5.6.13
non_targets = 
msf_location = /opt/metasploit3/bin/

[http-settings]
timeout = 15
headers_file = 
basic_auth_user = 
basic_auth_passwd = 
basic_auth_domain = 
ntlm_auth_domain = 
ntlm_auth_user = 
ntlm_auth_passwd = 
ntlm_auth_url = 
cookie_jar_file = 
ignore_session_cookies = False
proxy_port = 8080
proxy_address = 
user_agent = w3af.org
rand_user_agent = False
max_file_size = 400000
max_http_retries = 2
max_requests_per_second = 0
always_404 = 
never_404 = 
string_match_404 = 
url_parameter = 

[audit.rfd]

[grep.symfony]
override = False

[grep.file_upload]

[grep.wsdl_greper]

[grep.cross_domain_js]
secure_js_file = %ROOT_PATH%/plugins/grep/cross_domain_js/secure-js-sources.txt

[grep.http_auth_detect]

[grep.svn_users]

[grep.http_in_body]

[grep.xss_protection_header]

[grep.private_ip]

[grep.motw]

[grep.code_disclosure]

[grep.form_cleartext_password]

[grep.blank_body]

[grep.path_disclosure]

[grep.strange_http_codes]

[grep.credit_cards]

[grep.websockets_links]

[grep.csp]

[grep.dom_xss]

[grep.form_autocomplete]

[grep.clamav]
clamd_socket = /var/run/clamav/clamd.ctl

[grep.html_comments]

[grep.click_jacking]

[grep.strange_parameters]

[grep.url_session]

[grep.dot_net_event_validation]

[grep.objects]

[grep.error_500]

[grep.meta_tags]

[grep.password_profiling]

[grep.directory_indexing]

[grep.lang]

[grep.get_emails]
only_target_domain = True

[grep.hash_analysis]

[grep.strange_reason]

[grep.user_defined_regex]
single_regex = 
regex_file_path = %ROOT_PATH%/plugins/grep/user_defined_regex/empty.txt

[grep.cache_control]

[grep.strange_headers]

[grep.ssn]

[grep.oracle]

[grep.feeds]

[grep.analyze_cookies]

[audit.file_upload]

[audit.eval]

[audit.un_ssl]

[audit.os_commanding]

[audit.lfi]

[audit.sqli]

[audit.preg_replace]

[audit.mx_injection]

[audit.generic]

[audit.format_string]

[audit.shell_shock]

[audit.memcachei]

[audit.ldapi]

[audit.buffer_overflow]

[audit.redos]

[audit.global_redirect]

[audit.xpath]

[audit.cors_origin]

[audit.htaccess_methods]

[audit.dav]

[audit.ssi]

[audit.csrf]

[audit.xss]

[audit.ssl_certificate]

[audit.xst]

[audit.blind_sqli]

[audit.phishing_vector]

[audit.response_splitting]

[audit.rfi]

[audit.frontpage]

andresriancho commented 9 years ago

Commit message

Logging last N responses (just successful/failed state) to be able to calculate the error rate, which gives us a much better metric to understand if we need to delay our requests or not. Still the issue seems to be that the code at _pause_on_http_error doesn't seem to be called as often as needed, which in turn allows the error rate to increase and _should_stop_scan to return True.

This might be because of many threads hitting the xurllib at the same time. At any point in time 20 workers will send requests "simultaneously", if all of them fail (and that might happen if the site goes down for > socket_timeout) then we'll get 20 failed responses "at once", which might trigger this effect of _pause_on_http_error not being called.

See log at https://gist.github.com/andresriancho/5195c96dbed57c5c3a26

andresriancho commented 9 years ago

While the pcap seemed to indicate that it was a server issue (w3af did send the request but got no answer) another test proves that the server is actually alive:

import time
import requests

while 1:
    try:
        time.sleep(0.1)
    except KeyboardInterrupt:
        break

    try:
        response = requests.get('http://10.5.6.33/dvwa/login.php')
    except KeyboardInterrupt:
        break

    except Exception, e:
        print 'Offline ("%s")' % e

    else:
        print 'Online (%s)' % response.elapsed.total_seconds()

https://gist.github.com/andresriancho/02bbec9e734e0423adac running in one console, and w3af_gui -p test_error running in another shows that the server is still accessible from the onliner.py script. This makes me think that w3af might be re-using connections which are already dead/closed on the server side

andresriancho commented 9 years ago

[keepalive] Failed to re-use connection 773d7bcec9f6508d to 10.5.6.34 due to exception "timed out"
[keepalive] Removed connection 773d7bcec9f6508d, reason replace connection, 10.5.6.34 pool size is 27
[keepalive] Failed to re-use connection b6e00802f2a21cde to 10.5.6.34 due to exception "timed out"
[keepalive] Replaced bad connection 773d7bcec9f6508d with the new 72e0a93649d2db03
...
[keepalive] Removed connection 773d7bcec9f6508d, reason socket timeout, 10.5.6.34 pool size is 28

That doesn't look good... removed the same connection twice?

Two options:

I'm not removing it from a list, and handing it to someone who wants to send a request (after closing)
I'm having a race condition and returning the same connection more than once before putting it into the used state

andresriancho commented 9 years ago

Looks like it's the "I'm having a race condition and returning the same connection more than once before putting it into the used state" case:

[keepalive] Connection 0d06bc4f674d0173 was NOT removed from hostmap pool.
[keepalive] Connection 0d06bc4f674d0173 was NOT in free/used connection lists.
[keepalive] Removed connection 0d06bc4f674d0173, reason socket timeout, 10.5.6.34 pool size is 30

andresriancho commented 9 years ago

Fixed the "was NOT" stuff, was a naming bug in conn = self._cm.replace_connection(conn, host, conn_factory)

andresriancho commented 9 years ago

Is a 500 forcing keep-alive connection to die?

No

andresriancho commented 9 years ago

The requests which change the page parameter:

GET /dvwa/vulnerabilities/fi/?page=htTps%3A%2F%2Fw3af.org%2F HTTP/1.1
Connection: keep-alive
Host: 10.5.6.35
Referer: http://10.5.6.35/
Accept-encoding: gzip, deflate
Cookie: security=low; PHPSESSID=8fid7r3o46bqc9b97261qrqd32
Accept: */*
User-agent: w3af.org

Are failing with socket timeout because the OWASP VM doesn't have "good" internet access:

root@owaspbwa:/var/www/dvwa/vulnerabilities/fi# curl https://www.w3af.org/
... waiting some minutes ...

The onliner.py tool worked well because the apache server is still alive and working without any issues, the PHP code at fi/ is the one delaying all.

That's the reason for the delay in this request. Can be reproduced with:

$ nc 10.5.6.35 80 -v -v
Connection to 10.5.6.35 80 port [tcp/http] succeeded!
GET /dvwa/vulnerabilities/fi/?page=htTps%3A%2F%2Fw3af.org%2F HTTP/1.1
Connection: keep-alive
Host: 10.5.6.35
Referer: http://10.5.6.35/
Cookie: security=low; PHPSESSID=8fid7r3o46bqc9b97261qrqd32
Accept: */*
User-agent: w3af.org

... wait ...

HTTP/1.1 200 OK
Date: Fri, 06 Mar 2015 19:52:04 GMT
Server: Apache/2.2.14 (Ubuntu) mod_mono/2.4.3 PHP/5.3.2-1ubuntu4.5 with Suhosin-Patch proxy_html/3.0.1 mod_python/3.3.1 Python/2.6.5 mod_ssl/2.2.14 OpenSSL/0.9.8k Phusion_Passenger/3.0.17 mod_perl/2.0.4 Perl/v5.10.1

Just make sure you've got the right cookie or a redirect will be returned.

andresriancho commented 9 years ago

The solution to this issue is to check if the root domain path is still accessible when MAX_ERROR_COUNT - 1 (the default value of 10 - 1 is ok) is reached. If the root path is accessible, then I should add a (True, SUCCESS) to the _last_responses and continue. This will guarantee that the MAX_ERROR_COUNT is never reached if the root path is still accessible.

The solution above covers the case where one (or more) URLs are timing out, but the application is still accessible and running. The cases that will still trigger a scan must stop exception are:

The remote server is down
The user's internet connection is down

Something that I noticed while debugging/fixing this issue is that the default timeout is too high (15 seconds). This high value was mostly set due to multiple errors being generated by the library in the past, and me being unable to fix/debug/analyze them.

To fix the 15 second timeout I propose a new feature where:

The new default is set to 0
0 means that the timeout will be auto-calculated based on the time that it takes to reach the application multiplied by TIMEOUT_MULT_CONST
TIMEOUT_MULT_CONST will be set to a number between 3 and 5
The timeout will be auto-calculated based on the response times of the last 15 requests
When w3af starts it will set the timeout to 15 and then start calculating. Most likely the average calculated RTT will be ~300ms and the timeout of 300 * 3 == 900ms
If the user sers a timeout different that the default, that timeout is used

With this sliding timeout we'll be able to (also) get the nosetests w3af/core/data/url/tests/test_xurllib_error_handling.py -s test running faster, since the URLs will timeout faster.

One more thing to notice is that it might be a good idea to stop showing tracebacks for URLTimeout, since they are pretty common:

Traceback (most recent call last):
  File "/home/pablo/pch/w3af/w3af/core/data/url/extended_urllib.py", line 578, in _send
    res = self._opener.open(req)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/home/pablo/pch/w3af/w3af/core/data/url/handlers/keepalive/__init__.py", line 328, in http_open
    return self.do_open(req)
  File "/home/pablo/pch/w3af/w3af/core/data/url/handlers/keepalive/__init__.py", line 162, in do_open
    raise URLTimeoutError()
URLTimeoutError: HTTP timeout error.

andresriancho commented 9 years ago

[x] Check if site is still reachable
[x] ignore_errors might need renaming, also I might want to set it to True when checking if the site is up during strategy.py (when it checks if the site is up).
[x] Auto-adjust timeout
- [x] Take into account that error responses have None in their RTT
- [x] Don't re-calculate when I on S, S, S, ... S (50) E , S, S, S, ... S (50), E, E this would happen since the _total_requests only increases on successful requests
- [x] See build https://circleci.com/gh/andresriancho/w3af/1539
- [x] Now that we have low timeouts the plugins which used delays to discover vulnerabilities are failing due to timeouts. Doh! https://circleci.com/gh/andresriancho/w3af/1544
- [x] There is a big mismatch between what I wanted to achieve and what I actually did. When I set the timeout it is used for creating new HTTPConnections. Those connections are use to send ~100 requests each. If I adjust the timeout every 25 requests, it doesn't matter much since I'll have to wait for a new HTTPConnection to be spawned for the setting to take effect.
- [x] Setting a timeout for one specific request might be tricky since we need that request to be sent in a new connection which uses the specific timeout and then that connection should be closed. I might need to flag the request as "needs new connection" and also add the "connection: close" header to the request to make the KA handler drop the connection afterwards
- [x] During a scan of a fast site the timeout will be lowered to TIMEOUT_MIN , but if the user enabled a plugin that performs requests to the Internet then we use the same timeout and it will fail because "the internet site might be much slower than the target" https://circleci.com/gh/andresriancho/w3af/1576 . This means that we need timeouts per host!
[x] Write unittests specific to timeout auto-adjust
[x] How does the multiple error handling work on the first requests when the site is down? I never call _server_root_path_is_reachable because I never find the True, False, ..., False pattern, so I never raise the must stop exception, maybe add self._last_responses.append((True, SUCCESS))?
[ ] Make sure all tests PASS

andresriancho commented 9 years ago

Based on https://circleci.com/gh/andresriancho/w3af/1557 I'm sure that the timeout auto-adjust is generating all those randomly failing tests.

Still need to debug why they fail and how the feature needs to be fixed

andresriancho commented 9 years ago

https://circleci.com/gh/andresriancho/w3af/1578

andresriancho commented 9 years ago

https://circleci.com/gh/andresriancho/w3af/1581 https://circleci.com/gh/andresriancho/w3af/1580

andresriancho / w3af

Too many consecutive errors #8698

How to reproduce

Scan details

Specific timeout analysis

Profile