andresriancho / w3af

w3af: web application attack and audit framework, the open source web vulnerability scanner.
http://w3af.org/
4.52k stars 1.21k forks source link

Javascript crawler #1796

Open andresriancho opened 10 years ago

andresriancho commented 10 years ago

User story

As a user I would like to be able to scan sites which are heavily based on JavaScript.

Research

Architecture and implementation plan

Javascript crawler - Architecture and Implementation plan

Conditions of satisfaction

Potential browsers to use

One of the most important things to take into account when choosing a JS engine is how easy it will be for the users to install it on their workstations.

I've finally decided to use phantomjs, which seems to be the most supported, used and stable project out there.

PhantomJS add-ons

This module correctly handles pages which dynamically load content making AJAX requests. Instead of waiting fixed amount of time before rendering, we give a short time for the page to make additional requests.

We should use it too, I'm sure it's something simple but at least it is something less that we need to think about

oxdef commented 10 years ago

My slides from J4M 2012: http://yadi.sk/d/7VBHg0n9LXxsz https://svn.code.sf.net/p/w3af/code/branches/webapps/

andresriancho commented 10 years ago

Thanks for the slides! Will be helpful for this task.

Any overall recommendation?

https://github.com/andresriancho/w3af/commits/webapps eq https://svn.code.sf.net/p/w3af/code/branches/webapps/

andresriancho commented 10 years ago

Maybe this "Tainted Phantomjs (TPJS) is the scriptable tool for DOM-based XSS detection. It is built based on the open source PhantomJS by hacking the JavaScriptCore and WebKit engine with the tainted signal." might make me choose phantomjs?

andresriancho commented 10 years ago

PhantomJS is a headless WebKit with JavaScript API. It can be used for headless website testing. PhantomJS has a lot of different uses. The interesting bit for me is to use PhantomJS as a lighter-weight replacement for a browser when running web acceptance tests. This enables faster testing, without a display or the overhead of full-browser startup/shutdown.

http://python.dzone.com/articles/python-testing-phantomjs

oxdef commented 10 years ago

Andres, CasperJS is wrapper around PhantomJS which adds some "syntactic sugar".

Any overall recommendation?

I think we could use phantomjs/casperjs directly or with help of Selenium. There is also similar project based on Gecko https://github.com/laurentj/slimerjs. The main problem is how to crawl modern web app (call all events which changes current state and generates http requests to server side of web app) and make web applications' map of states.

andresriancho commented 10 years ago

http://stackoverflow.com/questions/13287490/is-there-a-way-to-use-phantomjs-in-python?lq=1

andresriancho commented 10 years ago

PhantomJS looks like the winner for now, some code that I'm drafting in my head is:

dom = browser.get_dom()
for event in EVENTS:
    for elem in dom.get_all_children():
        if not has_event_handler(elem, event):
            continue
        if has_changed(dom, browser.get_dom()):
            browser.set_dom(dom)
       browser.send_event(elem, event)
       browser.wait_until_done()

Looks simple... but its just pseudo-code. The nice thing about it is that I'm not re-loading the whole page when there is a change in the dom generated by one of my events, I just "save the dom" and set it again. Hopefully this is possible.

Also, the has_changed function should only return true when tags were added/removed from the DOM. Changes in values for attrs/text don't matter

andresriancho commented 10 years ago

We might use some of the code in gremlins.js as inspiration for our js crawler

kravietz commented 9 years ago

Having experimented a bit with JS crawling there are a few things to consider:

Then there's the question of architecture - I've been successfully crawling JS apps by running w3af with spider_man and the PhantomJS engine using it as a proxy. Here's a sample crawling script that does the job, however it doesn't support pages that require log-in or user action.

This scenario however doesn't differ much from just running a GUI browser and clicking through the application with spider_man running, which I believe most people are doing now. In such case w3af is just a passively observing the requests coming from the browser (or PhantomJS), not really crawling the website.

W3af could theoretically instrument PhantomJS but it will be challenging. PhantomJS discontinued its native Python API and while you can still use it via Selenium webdriver, it's severely limited - e.g. HTTP headers aren't accessible.

andresriancho commented 9 years ago

Thanks for confirming that phantomjs is the way to go.

Then there's the question of architecture - I've been successfully crawling JS apps by running w3af with spider_man and the PhantomJS engine using it as a proxy. Here's a sample crawling script that does the job, however it doesn't support pages that require log-in or user action.

Well, ideally we'll be able to produce something similar that does support credentials (via the already existing auth plugins)

A related issue which we'll be working on is the replacement of the old MITM proxy with https://github.com/andresriancho/w3af/issues/1269 , which should allow us to have a very stable and fast proxy

This scenario however doesn't differ much from just running a GUI browser and clicking through the application with spider_man running, which I believe most people are doing now

Yup, most people do that, but it's boring-non automated

In such case w3af is just a passively observing the requests coming from the browser (or PhantomJS), not really crawling the website.

W3af could theoretically instrument PhantomJS but it will be challenging. PhantomJS discontinued its native Python API and while you can still use it via Selenium webdriver, it's severely limited - e.g. HTTP headers aren't accessible.

Yes, we'll have to somehow instrument the phantomjs browser to do the crawling for us, it would be like migrating the "web_spider" plugin to that... we'll see how that goes. For now I'm focusing on completing some architecture refactoring/bug fixing, but I really look forward to working on this issue

andresriancho commented 9 years ago

Yet another option, this time supported by the guys from sca

http://splash.readthedocs.org/en/latest/ http://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/ https://github.com/scrapinghub/splash

Provides a nice REST API, which would be nice to have since it allows me to easily integrate with w3af and use a completely different process for all.

Iimplemented in Python using Twisted and QT.

andresriancho commented 7 years ago

https://drupalize.me/blog/201410/using-remote-debugger-casperjs-and-phantomjs would help debugging issues with https://github.com/yahoo/gryffin

andresriancho commented 7 years ago

Code from https://github.com/yahoo/gryffin/tree/master/renderer/resource is licensed under BSD 3-clause which makes it GPL-compatible

andresriancho commented 7 years ago

https://github.com/yahoo/gryffin/issues/33 has some issues, check this fork

andresriancho commented 7 years ago

https://github.com/ssonder/web_spider

andresriancho commented 6 years ago

TODO

andresriancho commented 6 years ago

After https://github.com/andresriancho/w3af/commit/dcf46d4afd913a325361e09637ee67017b02a6c8 , w3af will extract links and forms from Chrome-rendered DOMs. This is a great improvement for scanning sites which use JS heavily! :+1: