Browser Rendering? - Githubissues

elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

https://hexdocs.pm/crawly

Apache License 2.0

977 stars 115 forks source link

Browser Rendering? #18

Closed chavenor closed 4 years ago

chavenor commented 5 years ago

Is HMLT, CSS, JS in-browser rendering on the roadmap?

Thanks!

oltarasenko commented 5 years ago

@chavenor Yes, definitely!

Ziinc commented 4 years ago

This could be handled by a middleware. For the browser rendering, there are two stable options:

puppeteer (chrome only)
hound (requires Selenium server running + browser)

I'm thinking the hound route would be better

oltarasenko commented 4 years ago

I had a plan to showcase a library which we were developing in the past: https://github.com/scrapinghub/splash.

It might be used as a first option (e.g. no development required, just use splash as a proxy). However in the long term I would love to have the support for routing requests through the headless browsers (e.g. I think now browsers can be controlled via HTTPApi directly without selenium).

Selenium has limitations regarding setting http requests headers, etc. Still thinking here.

Ziinc commented 4 years ago

The usual solution for modifying the http headers is through the BrowserMob proxy

https://github.com/lightbody/browsermob-proxy

Using splash would require maintaining the elixir wrapper for the http api, which would be beyond the scope of crawly as a crawling engine.

Using hound would leverage their existing API wrapper instead.

If the goal is to simply render, then splash might be a good choice. If additional things like closing modals, interacting with the page is necessary, then hound might be a better choice

Ziinc commented 4 years ago

Merging issue into #27

oltarasenko commented 4 years ago

@Ziinc I want to discuss this again. Splash is not a full-featured replacement of the browser-based requests system. (It's a JS renderer)

We need to work on the support of something like headless chrome client. It will be required for those targets which would ban by the fingerprints of the HTTP header strings.

oltarasenko commented 4 years ago

@chavenor I would assume that with basic splash renderer this can be closed. Of course, we would have to continue towards the headless Chrome. However, for now, I don't see demand or requests for that feature immediately.