Open janbuchar opened 4 months ago
For https://github.com/apify/crawlee-python/issues/249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance
I wanted to have something like parseWIthCheerio
here too, not just for adaptive crawler. But if we allow different parses in it, I am not sure how portable the code will be, since the parser dictates the return type of such helper, right?
Parametrize HttpCrawler with an HTML parser
I like this one. But I would say we want to have some sane default, let it be beautifulsoup or anything else. I guess it depends on how fat that dependency is, since this default should be always installed, that's how we do it in JS version, cheerio is not an optional dependency.
we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext
Similarly to the default parser, we should have default request client, and if we have one, it feels ok to have that helper in basic crawler.
For #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance
I wanted to have something like
parseWIthCheerio
here too, not just for adaptive crawler. But if we allow different parses in it, I am not sure how portable the code will be, since the parser dictates the return type of such helper, right?
Yeah, the crawler class would have to be generic over the return type of the HTML parser, if that makes any sense. And it wouldn't provide the same "pluggability" that we intend to have for HTTP clients... which is probably fine.
Parametrize HttpCrawler with an HTML parser
I like this one. But I would say we want to have some sane default, let it be beautifulsoup or anything else. I guess it depends on how fat that dependency is, since this default should be always installed, that's how we do it in JS version, cheerio is not an optional dependency.
I imagine we could either keep BeautifulSoupCrawler
(but it would not contain much logic), and have HttpCrawler
have no HTML parser by default (it could throw an error when somebody attempts to parse HTML). Or we could have any parser as the default and check for the dependencies on instantiation (we already do a similar thing when importing BeautifulSoupCrawler
and PlaywrightCrawler
)
we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext
Similarly to the default parser, we should have default request client, and if we have one, it feels ok to have that helper in basic crawler.
Yeah, currently HttpxClient
is the default.
I made a gist to illustrate a possible new inheritance hierarchy, feel free to comment. https://gist.github.com/janbuchar/0412e1b4224065e40e937e91d474f145
I made a gist to illustrate a possible new inheritance hierarchy, feel free to comment. https://gist.github.com/janbuchar/0412e1b4224065e40e937e91d474f145
It looks great!
I'm even thinking about whether specific subclasses like BeautifulSoupCrawler
/ ParselCrawler
might be unnecessary when the HttpCrawler
class itself can serve the purpose with the proper configuration of parsers and HTTP clients (with some abstractions, as you suggested).
class BeautifulSoupCrawler(
HttpCrawler[BeautifulSoupCrawlingContext, BeautifulSoupResult]
):
pass
It would be a big breaking change of course...
Also, It seems in your PoC I can do the following:
parsel_crawler = ParselCrawler(
http_client=httpx_client,
parser=BeautifulSoupStaticContentParser()
)
(Having an instance of ParselCrawler
with the BeautifulSoup
parser.)
We probably wouldn't want to expose the parser
on the BS/Parsel level.
I'm even thinking about whether specific subclasses like
BeautifulSoupCrawler
/ParselCrawler
might be unnecessary when theHttpCrawler
class itself can serve the purpose with the proper configuration of parsers and HTTP clients (with some abstractions, as you suggested).class BeautifulSoupCrawler( HttpCrawler[BeautifulSoupCrawlingContext, BeautifulSoupResult] ): pass
Well, you could just use HttpCrawler
, true, but it wouldn't be as user-friendly - manually writing out type parameters gets tedious fast. I'd probably keep those classes just for convenience.
Also, It seems in your PoC I can do the following:
parsel_crawler = ParselCrawler( http_client=httpx_client, parser=BeautifulSoupStaticContentParser() )
(Having an instance of
ParselCrawler
with theBeautifulSoup
parser.)We probably wouldn't want to expose the
parser
on the BS/Parsel level.
True.
Currently, we have the following inheritance chains:
BasicCrawler
->HttpCrawler
BasicCrawler
->BeautifulSoupCrawler
BasicCrawler
->PlaywrightCrawler
BasicCrawler
->ParselCrawler
(#348 )This is an intentional difference from the JS version, where
BrowserCrawler
is a common ancestor ofPlaywrightCrawler
andPuppeteerCrawler
CheerioCrawler
andJSDomCrawler
inherit fromHttpCrawler
We might want to reconsider this because
The possible ways out are
HttpCrawler
with an HTML parserBeautifulSoupCrawler
andParselCrawler
very thin - they would just pass the rightHttpClient
andHtmlParser
toHttpCrawler
send_request
context helper fromBasicCrawlingContext
toHttpCrawlingContext
HttpCrawler
altogether and pull its functionality intoBasicCrawler