apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
3.78k stars 260 forks source link

Add new HTTP client `curl-cffi` #292

Closed siddiqkaithodu closed 1 month ago

siddiqkaithodu commented 1 month ago

Since we use HTTPX , Isn't better to use curl_cffi instead or use it an optional client that switches when the HTTPX gets defeated in the AntiBot war.

I am thinking about intelligently switching between multiple http clients. (The same as of proxy rotation)

vdusek commented 1 month ago

Hi, thanks for the issue. I think we can split this one into two parts. First, we can (probably, needs further investigation) switch from httpx to curl_cffi, which seems better meet our use-case. The second part would be the switching, which should be part of a separate issue/discussion.