Add new Python template - Scrapy & Playwright

vdusek commented 7 months ago

Some "JavaScript-heavy websites" (e.g. https://tripadvisor.com) cannot be scraped by using just Scrapy.

Can you check why our Beautiful Soup template fails on tripadvisor.com? https://console.apify.com/actors/jWYbXHu32SvZf1Cgb/runs/0IYh4rWH9Ig2vIUSM#output

Solution: We can provide a new Scrapy Actor template using a headless browser like Playwright.
PyPI packages: scrapy and scrapy-playwright.
The integration of Playwright into the Scrapy project is pretty simple, scrapy-playwright provides a Scrapy component ScrapyPlaywrightDownloadHandler, which needs to be added to the project.
Check the Web scraping with Scrapy blog post for more information and inspiration.

honzajavorek commented 3 months ago

I see the main challenge in setting PLAYWRIGHT_LAUNCH_OPTIONS correctly to respect APIFY_PROXY_SETTINGS (docs). Or maybe passing it like this, not sure.

honzajavorek commented 3 months ago

Hmm I think since the playwright integration doesn't support proxy per request, only proxy per browser context, the correct implementation would be to probably rotate browser contexts with proxies for playwright-enabled requests as part of ApifyHttpProxyMiddleware 🤔 Hard to implement on my own as part of the spider code.

honzajavorek commented 3 months ago

Note: If this template ever exists, it should contain playwright install --with-deps somewhere in the Dockerfile. This has just bitten me.

honzajavorek commented 3 months ago

So obviously I have no idea what I'm doing, but today I invented this and it seems like it could be working. It's hard to verify, but it looks like I might be successfully sending Playwright requests over Apify proxy. This is how I override Apify settings:

...
settings = apply_apify_settings(settings=settings, proxy_config=proxy_config)

# use custom proxy middleware
priority = settings["DOWNLOADER_MIDDLEWARES"].pop(
    "apify.scrapy.middlewares.ApifyHttpProxyMiddleware"
)
settings["DOWNLOADER_MIDDLEWARES"][
    "jg.plucker.scrapers.PlaywrightApifyHttpProxyMiddleware"
] = priority
...

And this is the actual implementation of my custom middleware:

class PlaywrightApifyHttpProxyMiddleware(ApifyHttpProxyMiddleware):
    @classmethod
    def from_crawler(cls, crawler: Crawler) -> Self:
        Actor.log.info("Using customized ApifyHttpProxyMiddleware.")
        return cls(super().from_crawler(crawler)._proxy_settings)

    async def process_request(self, request: Request, spider: Spider):
        if request.meta.get("playwright"):
            Actor.log.debug(
                f"ApifyHttpProxyMiddleware.process_request: playwright=True, request={request}, spider={spider}"
            )
            url = await self._get_new_proxy_url()

            if not (url.username and url.password):
                raise ValueError(
                    "Username and password must be provided in the proxy URL."
                )

            proxy = url.geturl()
            proxy_hash = hashlib.sha1(proxy.encode()).hexdigest()[0:8]
            context_name = f"proxy_{proxy_hash}"
            Actor.log.info(f"Using Playwright context {context_name}")
            request.meta.update(
                {
                    "playwright_context": f"proxy_{context_name}",
                    "playwright_context_kwargs": {
                        "proxy": {
                            "server": proxy,
                            "username": url.username,
                            "password": url.password,
                        },
                    },
                }
            )
            Actor.log.debug(
                f"ApifyHttpProxyMiddleware.process_request: updated request.meta={request.meta}"
            )
        else:
            await super().process_request(request, spider)

I'll yet see if it performs reasonably in the following days. Also, FWIW, adding playwright install --with-deps to my Dockerfile has caused my builds quite a while to finish. If you know about a more efficient approach, that would be awesome:

RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing Poetry:" \
 && pip install --no-cache-dir poetry~=1.7.1 \
 && echo "Installing dependencies:" \
 && poetry config cache-dir /tmp/.poetry-cache \
 && poetry config virtualenvs.in-project true \
 && poetry install --only=main --no-interaction --no-ansi \
 && rm -rf /tmp/.poetry-cache \
 && echo "All installed Python packages:" \
 && pip freeze \
 && echo "Installing Playwright dependencies:" \
 && poetry run playwright install firefox --with-deps

apify / actor-templates

Add new Python template - Scrapy & Playwright #252