Closed segment-srl closed 5 years ago
You are talking puppeteer or chromeless ?
since puppeteer is the project started by the team who develop chrome, I would be inclined to use this lib.
hi and sorry for the delayed answere... I agree with Gullohome, puppeteer is the best choice. I made some tests with puppeteer and the crawler is working pretty well with very few modification to the htcap's js code. I'm still facing a couple of problems:
Point 1 may gets resolved by writing a chrome extension but point 2 is very problematic. In Chrome it's possible to intercept and abort requests but not the page navigation. For example if we allow the loading of scripts, it's possible that the crawler will naviate to a .js url... also it's not possible to prevent navigation to about:blank (es <a href="about:blank"...) I'm going to perform more tests to find out if I'm missing something...
We did all this in our fork. If you want to take a look of the implementation details, it is here: https://github.com/delvelabs/htcap/tree/master/core/crawl/probe
We did a lot of work to reach a stable (enough) implementation and it will be deployed in our production environment in January.
I tried your fork and it seems it faces the same issue as my test code. If a page contains a link to about:blank (<a href="about:blank") the navigation is not locked.
@segment-srl you are right, any "special" uri scheme makes the probe hang… we didn't found a solution yet. it should be possible to solve it through the webNavigation
feature available in chrome extension.
We choose to postpone the issue since not many website use other scheme than http(s)
in href
attributes but it have to be handle at some point.
Phantomjs is no longer under development so we need to move to headless Chrome