Open vdusek opened 7 months ago
Hmm I think since the playwright integration doesn't support proxy per request, only proxy per browser context, the correct implementation would be to probably rotate browser contexts with proxies for playwright-enabled requests as part of ApifyHttpProxyMiddleware
🤔 Hard to implement on my own as part of the spider code.
Note: If this template ever exists, it should contain playwright install --with-deps
somewhere in the Dockerfile
. This has just bitten me.
So obviously I have no idea what I'm doing, but today I invented this and it seems like it could be working. It's hard to verify, but it looks like I might be successfully sending Playwright requests over Apify proxy. This is how I override Apify settings:
...
settings = apply_apify_settings(settings=settings, proxy_config=proxy_config)
# use custom proxy middleware
priority = settings["DOWNLOADER_MIDDLEWARES"].pop(
"apify.scrapy.middlewares.ApifyHttpProxyMiddleware"
)
settings["DOWNLOADER_MIDDLEWARES"][
"jg.plucker.scrapers.PlaywrightApifyHttpProxyMiddleware"
] = priority
...
And this is the actual implementation of my custom middleware:
class PlaywrightApifyHttpProxyMiddleware(ApifyHttpProxyMiddleware):
@classmethod
def from_crawler(cls, crawler: Crawler) -> Self:
Actor.log.info("Using customized ApifyHttpProxyMiddleware.")
return cls(super().from_crawler(crawler)._proxy_settings)
async def process_request(self, request: Request, spider: Spider):
if request.meta.get("playwright"):
Actor.log.debug(
f"ApifyHttpProxyMiddleware.process_request: playwright=True, request={request}, spider={spider}"
)
url = await self._get_new_proxy_url()
if not (url.username and url.password):
raise ValueError(
"Username and password must be provided in the proxy URL."
)
proxy = url.geturl()
proxy_hash = hashlib.sha1(proxy.encode()).hexdigest()[0:8]
context_name = f"proxy_{proxy_hash}"
Actor.log.info(f"Using Playwright context {context_name}")
request.meta.update(
{
"playwright_context": f"proxy_{context_name}",
"playwright_context_kwargs": {
"proxy": {
"server": proxy,
"username": url.username,
"password": url.password,
},
},
}
)
Actor.log.debug(
f"ApifyHttpProxyMiddleware.process_request: updated request.meta={request.meta}"
)
else:
await super().process_request(request, spider)
I'll yet see if it performs reasonably in the following days. Also, FWIW, adding playwright install --with-deps
to my Dockerfile
has caused my builds quite a while to finish. If you know about a more efficient approach, that would be awesome:
RUN echo "Python version:" \
&& python --version \
&& echo "Pip version:" \
&& pip --version \
&& echo "Installing Poetry:" \
&& pip install --no-cache-dir poetry~=1.7.1 \
&& echo "Installing dependencies:" \
&& poetry config cache-dir /tmp/.poetry-cache \
&& poetry config virtualenvs.in-project true \
&& poetry install --only=main --no-interaction --no-ansi \
&& rm -rf /tmp/.poetry-cache \
&& echo "All installed Python packages:" \
&& pip freeze \
&& echo "Installing Playwright dependencies:" \
&& poetry run playwright install firefox --with-deps
scrapy-playwright
provides a Scrapy componentScrapyPlaywrightDownloadHandler
, which needs to be added to the project.