fcavallarin / htcap

htcap is a web application scanner able to crawl single page application (SPA) recursively by intercepting ajax calls and DOM changes.
GNU General Public License v2.0
610 stars 114 forks source link

Move to headless chrome #31

Closed segment-srl closed 5 years ago

segment-srl commented 7 years ago

Phantomjs is no longer under development so we need to move to headless Chrome

ring04h commented 6 years ago

You are talking puppeteer or chromeless ?

GuilloOme commented 6 years ago

since puppeteer is the project started by the team who develop chrome, I would be inclined to use this lib.

segment-srl commented 6 years ago

hi and sorry for the delayed answere... I agree with Gullohome, puppeteer is the best choice. I made some tests with puppeteer and the crawler is working pretty well with very few modification to the htcap's js code. I'm still facing a couple of problems:

  1. it seems not possible to load a page using a POST request (with custom headers)
  2. it seems that there is no reliable way to "lock navigation" as in phantomJS

Point 1 may gets resolved by writing a chrome extension but point 2 is very problematic. In Chrome it's possible to intercept and abort requests but not the page navigation. For example if we allow the loading of scripts, it's possible that the crawler will naviate to a .js url... also it's not possible to prevent navigation to about:blank (es <a href="about:blank"...) I'm going to perform more tests to find out if I'm missing something...

GuilloOme commented 6 years ago

We did all this in our fork. If you want to take a look of the implementation details, it is here: https://github.com/delvelabs/htcap/tree/master/core/crawl/probe

We did a lot of work to reach a stable (enough) implementation and it will be deployed in our production environment in January.

segment-srl commented 6 years ago

I tried your fork and it seems it faces the same issue as my test code. If a page contains a link to about:blank (<a href="about:blank") the navigation is not locked.

GuilloOme commented 6 years ago

@segment-srl you are right, any "special" uri scheme makes the probe hang… we didn't found a solution yet. it should be possible to solve it through the webNavigation feature available in chrome extension.

We choose to postpone the issue since not many website use other scheme than http(s) in href attributes but it have to be handle at some point.