apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
3.79k stars 260 forks source link

add support for Parsel #335

Closed Ehsan-U closed 1 month ago

Ehsan-U commented 1 month ago

BeautifulSoup lacks proper type hints, mostly Any type, hence not effective IDE autocompletion. A solid alternative is Parsel. It supports CSS selectors, XPath expressions for HTML and XML, JMESPath for JSON documents, and Regex expressions. Additionally, Parsel is the parser used by Scrapy.

siddiqkaithodu commented 1 month ago

I was thinking about selectolax

Ehsan-U commented 1 month ago

selectolax doesn't support XPATH selector nor JMESPath for JSON.

janbuchar commented 1 month ago

We started out with BeautifulSoup because of its popularity, but you're right that it has its shortcomings. Adding support for either selectolax or parsel as a new crawler type should be fairly easy - we'll consider it.

asymness commented 1 month ago

+1 for Parsel

asymness commented 1 month ago

@janbuchar, I'd like to help out by adding Parsel support as a new crawler type. Would you be open to a PR from me for this?

janbuchar commented 1 month ago

@janbuchar, I'd like to help out by adding Parsel support as a new crawler type. Would you be open to a PR from me for this?

Absolutely :slightly_smiling_face: