elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
971 stars 114 forks source link

`Crawly.Fetchers.Fetcher` implementation for Playwright #246

Closed Nezteb closed 6 months ago

Nezteb commented 1 year ago

Currently crawly has an implementation for Splash: https://github.com/elixir-crawly/crawly/blob/5eeeb2a3ba230ee55d2411a64f9e426957dc8c40/lib/crawly/fetchers/splash.ex

I tend to use Playwright (or Puppeteer if I only care about Chromium) for browser automation and testing, so it'd be cool to be able to use some of it's functionality from crawly.

The only thing I'm unsure of is whether or not Playwright exposes a requests page/API like Splash does:

Splash exposes the render.html endpoint which renders incoming requests sent with ?url get parameter.

I might end up picking this up, but I figured I'd create an issue beforehand. 😄

oltarasenko commented 1 year ago

Hard to say. I did not have a chance to explore these two tools. In some of my previous projects, phantom js was used for browser rendering, but now it seems to be a bit dead.

It would be interesting to see an example fetcher for Playwright or Puppeteer. Maybe we can add it to Crawly as a standard fetcher :) Just let me know how it goes!

Nezteb commented 1 year ago

As a non-Elixir example, I just built a scraper for sites that will save each page as a PDF using Playwright: https://github.com/Nezteb/scrape-pdf

Next weekend I'll see what I can do about a crawly fetcher for it!

dbrody commented 1 year ago

https://github.com/mechanical-orchard/playwright-elixir will probably be able to support what you are looking for.

Nezteb commented 1 year ago

mechanical-orchard/playwright-elixir

Oh nice, I'll check that out! I'll see if I can get a minimal demo of using crawly along with playwright-elixir as the fetcher!