Open andresriancho opened 10 years ago
My slides from J4M 2012: http://yadi.sk/d/7VBHg0n9LXxsz https://svn.code.sf.net/p/w3af/code/branches/webapps/
Thanks for the slides! Will be helpful for this task.
Any overall recommendation?
https://github.com/andresriancho/w3af/commits/webapps eq https://svn.code.sf.net/p/w3af/code/branches/webapps/
PhantomJS is a headless WebKit with JavaScript API. It can be used for headless website testing. PhantomJS has a lot of different uses. The interesting bit for me is to use PhantomJS as a lighter-weight replacement for a browser when running web acceptance tests. This enables faster testing, without a display or the overhead of full-browser startup/shutdown.
Andres, CasperJS is wrapper around PhantomJS which adds some "syntactic sugar".
Any overall recommendation?
I think we could use phantomjs/casperjs directly or with help of Selenium. There is also similar project based on Gecko https://github.com/laurentj/slimerjs. The main problem is how to crawl modern web app (call all events which changes current state and generates http requests to server side of web app) and make web applications' map of states.
PhantomJS looks like the winner for now, some code that I'm drafting in my head is:
dom = browser.get_dom()
for event in EVENTS:
for elem in dom.get_all_children():
if not has_event_handler(elem, event):
continue
if has_changed(dom, browser.get_dom()):
browser.set_dom(dom)
browser.send_event(elem, event)
browser.wait_until_done()
Looks simple... but its just pseudo-code. The nice thing about it is that I'm not re-loading the whole page when there is a change in the dom generated by one of my events, I just "save the dom" and set it again. Hopefully this is possible.
Also, the has_changed
function should only return true when tags were added/removed from the DOM. Changes in values for attrs/text don't matter
We might use some of the code in gremlins.js as inspiration for our js crawler
Having experimented a bit with JS crawling there are a few things to consider:
Then there's the question of architecture - I've been successfully crawling JS apps by running w3af with spider_man
and the PhantomJS engine using it as a proxy. Here's a sample crawling script that does the job, however it doesn't support pages that require log-in or user action.
This scenario however doesn't differ much from just running a GUI browser and clicking through the application with spider_man
running, which I believe most people are doing now. In such case w3af is just a passively observing the requests coming from the browser (or PhantomJS), not really crawling the website.
W3af could theoretically instrument PhantomJS but it will be challenging. PhantomJS discontinued its native Python API and while you can still use it via Selenium webdriver, it's severely limited - e.g. HTTP headers aren't accessible.
Thanks for confirming that phantomjs is the way to go.
Then there's the question of architecture - I've been successfully crawling JS apps by running w3af with spider_man and the PhantomJS engine using it as a proxy. Here's a sample crawling script that does the job, however it doesn't support pages that require log-in or user action.
Well, ideally we'll be able to produce something similar that does support credentials (via the already existing auth plugins)
A related issue which we'll be working on is the replacement of the old MITM proxy with https://github.com/andresriancho/w3af/issues/1269 , which should allow us to have a very stable and fast proxy
This scenario however doesn't differ much from just running a GUI browser and clicking through the application with spider_man running, which I believe most people are doing now
Yup, most people do that, but it's boring-non automated
In such case w3af is just a passively observing the requests coming from the browser (or PhantomJS), not really crawling the website.
W3af could theoretically instrument PhantomJS but it will be challenging. PhantomJS discontinued its native Python API and while you can still use it via Selenium webdriver, it's severely limited - e.g. HTTP headers aren't accessible.
Yes, we'll have to somehow instrument the phantomjs browser to do the crawling for us, it would be like migrating the "web_spider" plugin to that... we'll see how that goes. For now I'm focusing on completing some architecture refactoring/bug fixing, but I really look forward to working on this issue
Yet another option, this time supported by the guys from sca
http://splash.readthedocs.org/en/latest/ http://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/ https://github.com/scrapinghub/splash
Provides a nice REST API, which would be nice to have since it allows me to easily integrate with w3af and use a completely different process for all.
Iimplemented in Python using Twisted and QT.
https://drupalize.me/blog/201410/using-remote-debugger-casperjs-and-phantomjs would help debugging issues with https://github.com/yahoo/gryffin
Code from https://github.com/yahoo/gryffin/tree/master/renderer/resource is licensed under BSD 3-clause which makes it GPL-compatible
https://github.com/yahoo/gryffin/issues/33 has some issues, check this fork
PyChromeDevTools
, it was not available in pypi when I started the development, now it is and I should reference it appropiately in requirements.py. See details here.After https://github.com/andresriancho/w3af/commit/dcf46d4afd913a325361e09637ee67017b02a6c8 , w3af will extract links and forms from Chrome-rendered DOMs. This is a great improvement for scanning sites which use JS heavily! :+1:
User story
As a user I would like to be able to scan sites which are heavily based on JavaScript.
Research
Architecture and implementation plan
Javascript crawler - Architecture and Implementation plan
Conditions of satisfaction
diskCachePath
Potential browsers to use
One of the most important things to take into account when choosing a JS engine is how easy it will be for the users to install it on their workstations.
Browser
I've finally decided to use phantomjs, which seems to be the most supported, used and stable project out there.
PhantomJS add-ons
We should use it too, I'm sure it's something simple but at least it is something less that we need to think about