apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
3.83k stars 265 forks source link

Reconsider crawler inheritance #350

Open janbuchar opened 1 month ago

janbuchar commented 1 month ago

Currently, we have the following inheritance chains:

This is an intentional difference from the JS version, where

We might want to reconsider this because

The possible ways out are

  1. Leave it as it is now
  2. Parametrize HttpCrawler with an HTML parser
    • this would make BeautifulSoupCrawler and ParselCrawler very thin - they would just pass the right HttpClient and HtmlParser to HttpCrawler
    • we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext
  3. Remove HttpCrawler altogether and pull its functionality into BasicCrawler
B4nan commented 1 month ago

For https://github.com/apify/crawlee-python/issues/249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance

I wanted to have something like parseWIthCheerio here too, not just for adaptive crawler. But if we allow different parses in it, I am not sure how portable the code will be, since the parser dictates the return type of such helper, right?

Parametrize HttpCrawler with an HTML parser

I like this one. But I would say we want to have some sane default, let it be beautifulsoup or anything else. I guess it depends on how fat that dependency is, since this default should be always installed, that's how we do it in JS version, cheerio is not an optional dependency.

we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext

Similarly to the default parser, we should have default request client, and if we have one, it feels ok to have that helper in basic crawler.

janbuchar commented 1 month ago

For #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance

I wanted to have something like parseWIthCheerio here too, not just for adaptive crawler. But if we allow different parses in it, I am not sure how portable the code will be, since the parser dictates the return type of such helper, right?

Yeah, the crawler class would have to be generic over the return type of the HTML parser, if that makes any sense. And it wouldn't provide the same "pluggability" that we intend to have for HTTP clients... which is probably fine.

Parametrize HttpCrawler with an HTML parser

I like this one. But I would say we want to have some sane default, let it be beautifulsoup or anything else. I guess it depends on how fat that dependency is, since this default should be always installed, that's how we do it in JS version, cheerio is not an optional dependency.

I imagine we could either keep BeautifulSoupCrawler (but it would not contain much logic), and have HttpCrawler have no HTML parser by default (it could throw an error when somebody attempts to parse HTML). Or we could have any parser as the default and check for the dependencies on instantiation (we already do a similar thing when importing BeautifulSoupCrawler and PlaywrightCrawler)

we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext

Similarly to the default parser, we should have default request client, and if we have one, it feels ok to have that helper in basic crawler.

Yeah, currently HttpxClient is the default.

janbuchar commented 1 week ago

I made a gist to illustrate a possible new inheritance hierarchy, feel free to comment. https://gist.github.com/janbuchar/0412e1b4224065e40e937e91d474f145

vdusek commented 1 week ago

I made a gist to illustrate a possible new inheritance hierarchy, feel free to comment. https://gist.github.com/janbuchar/0412e1b4224065e40e937e91d474f145

It looks great!

I'm even thinking about whether specific subclasses like BeautifulSoupCrawler / ParselCrawler might be unnecessary when the HttpCrawler class itself can serve the purpose with the proper configuration of parsers and HTTP clients (with some abstractions, as you suggested).

class BeautifulSoupCrawler(
    HttpCrawler[BeautifulSoupCrawlingContext, BeautifulSoupResult]
):
    pass

It would be a big breaking change of course...

Also, It seems in your PoC I can do the following:

parsel_crawler = ParselCrawler(
    http_client=httpx_client, 
    parser=BeautifulSoupStaticContentParser()
)

(Having an instance of ParselCrawler with the BeautifulSoup parser.)

We probably wouldn't want to expose the parser on the BS/Parsel level.

janbuchar commented 1 week ago

I'm even thinking about whether specific subclasses like BeautifulSoupCrawler / ParselCrawler might be unnecessary when the HttpCrawler class itself can serve the purpose with the proper configuration of parsers and HTTP clients (with some abstractions, as you suggested).

class BeautifulSoupCrawler(
    HttpCrawler[BeautifulSoupCrawlingContext, BeautifulSoupResult]
):
    pass

Well, you could just use HttpCrawler, true, but it wouldn't be as user-friendly - manually writing out type parameters gets tedious fast. I'd probably keep those classes just for convenience.

Also, It seems in your PoC I can do the following:

parsel_crawler = ParselCrawler(
    http_client=httpx_client, 
    parser=BeautifulSoupStaticContentParser()
)

(Having an instance of ParselCrawler with the BeautifulSoup parser.)

We probably wouldn't want to expose the parser on the BS/Parsel level.

True.