Closed chavenor closed 4 years ago
@chavenor Yes, definitely!
This could be handled by a middleware. For the browser rendering, there are two stable options:
I'm thinking the hound route would be better
I had a plan to showcase a library which we were developing in the past: https://github.com/scrapinghub/splash.
It might be used as a first option (e.g. no development required, just use splash as a proxy). However in the long term I would love to have the support for routing requests through the headless browsers (e.g. I think now browsers can be controlled via HTTPApi directly without selenium).
The usual solution for modifying the http headers is through the BrowserMob proxy
https://github.com/lightbody/browsermob-proxy
Using splash would require maintaining the elixir wrapper for the http api, which would be beyond the scope of crawly as a crawling engine.
Using hound would leverage their existing API wrapper instead.
If the goal is to simply render, then splash might be a good choice. If additional things like closing modals, interacting with the page is necessary, then hound might be a better choice
Merging issue into #27
@Ziinc I want to discuss this again. Splash is not a full-featured replacement of the browser-based requests system. (It's a JS renderer)
We need to work on the support of something like headless chrome client. It will be required for those targets which would ban by the fingerprints of the HTTP header strings.
@chavenor I would assume that with basic splash renderer this can be closed. Of course, we would have to continue towards the headless Chrome. However, for now, I don't see demand or requests for that feature immediately.
Is HMLT, CSS, JS in-browser rendering on the roadmap?
Thanks!