fcavallarin / htcap

htcap is a web application scanner able to crawl single page application (SPA) recursively by intercepting ajax calls and DOM changes.
GNU General Public License v2.0
610 stars 114 forks source link

Site using `window.location` to navigate are not crawled properly #22

Closed GuilloOme closed 7 years ago

GuilloOme commented 7 years ago

Site relying on window.location for navigation are not crawled properly.

A crawl on a page like this one will not see all the site:

<html>
    <head>
    <script>
        function go1(){
            window.location = "../1.php";
        }
        function go2(){
            window.location = "../2.php";
        }
        function go3(){
            window.location = "../3.php";
        }
    </script>
    </head>
    <body>
        <span onclick="go1()">click here </span> 
        <a onclick="go2()">click here </a> 
        <td onclick="go3()">click here </td> 
    </body>
</html>
GuilloOme commented 7 years ago

I opened a issue here: https://github.com/ariya/phantomjs/issues/14782

Update: I closed the phantomJS issue today (2016-04-03). After some research, it is an expected behavior from phantomJS/webkit. Since all the events are triggered simultaneously, phantomJS (through webkit) only navigate to the last location requested (see more detail in the phantomjs issue), then in htcap, only the last one is trapped.

GuilloOme commented 7 years ago

Here is where I am now for the resolution of this issue:

Root cause of the issue Because all the events found in the DOM are triggered back-to-back without waiting the end of the current stack to resolve, only the last window.location is visible through the onNavigationRequested called by phantomJS.

Solution found Make sure that none of the events are triggered within the same stack: this way, avoiding collision and inconsistency in the page context (the events we trigger this way are meant to be triggered by a human user so the page script could not have been engineered to handle this kind of concurrency). Technically, it means using a messageEvent queue to place it one by one in the event loop.

Difficulties The actual implementation of the JS probe is not designed to handle async logic: the usage of setInterval as a loop and global flags for managing the state of the analyzeDOM implie skipping step in the process… for more info: Concurrency model and Event Loop and this very good talk

It is possible that it will be necessary to refactor the current analyzeDOM() method.