apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.12k stars 283 forks source link

HTTP API for Spider #295

Closed Ehsan-U closed 1 month ago

Ehsan-U commented 3 months ago

Scrapy offers an HTTP API through a third-party library called ScrapyRT, which exposes an HTTP API for spiders. By sending a request to ScrapyRT with the spider name and URL, you receive the items collected by the spider from that URL.

It would be great if Crawlee could provide similar functionality out of the box. Its like turning a website into realtime API.

janbuchar commented 3 months ago

I think it's pretty straightforward to expose the crawler.run method using FastAPI, for instance. The following snippet is without any guarantees, but it shouldn't be too incorrect :smile:

from fastapi import FastAPI
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from typing import Any

crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext) -> None:
  pass

app = FastAPI()

@app.post("/crawl")
async def crawl(url: str) -> Any:
  await crawler.run([url])
  return await crawler.get_data()

Note that this has to be run as a FastAPI app, e.g. with fastapi run.

It is possible that we will implement something even more streamlined in the future though, so feel free to throw ideas here.