HTTP API for Spider - Githubissues

apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Apache License 2.0

4.12k stars 283 forks source link

I think it's pretty straightforward to expose the crawler.run method using FastAPI, for instance. The following snippet is without any guarantees, but it shouldn't be too incorrect :smile:

from fastapi import FastAPI
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from typing import Any

crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext) -> None:
  pass

app = FastAPI()

@app.post("/crawl")
async def crawl(url: str) -> Any:
  await crawler.run([url])
  return await crawler.get_data()

Note that this has to be run as a FastAPI app, e.g. with fastapi run.

It is possible that we will implement something even more streamlined in the future though, so feel free to throw ideas here.

apify / crawlee-python

HTTP API for Spider #295