apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.64k stars 319 forks source link

encoding errors when using the BeautifulSoupCrawlingContext #695

Open Rigos0 opened 1 week ago

Rigos0 commented 1 week ago

When running a crawler using the BeautifulSoupCrawlingContext, I am getting unfixable encoding errors. They are thrown even before the handler function is called.

"encoding error : input conversion failed due to input error, bytes 0xEB 0x85 0x84 0x20"

async def main() -> None:
    # async with Actor:
        crawler = BeautifulSoupCrawler()

        @crawler.router.default_handler
        async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
            url = context.request.url
            print(f"Processing URL: {url}")

The error occurs in about 30% of requests when trying to scrape reviews from booking. Some example links for replication:

https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25

https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50

I found relevant issue stating "Libxml2 does not support the GB2312 encoding so a way to get around this problem is to convert it to utf-8. I did it and it works for me:" https://github.com/mitmproxy/mitmproxy/issues/657 but I did not manage to fix your BeautifulSoupCrawlingContext code by specifying the encoding.

janbuchar commented 1 week ago

Hello @Rigos0! I tried reproducing this with the following snippet:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

from .routes import router

async def main() -> None:
    """The crawler entry point."""
    crawler = BeautifulSoupCrawler(
        request_handler=router,
        max_requests_per_crawl=50,
        http_client=CurlImpersonateHttpClient(),
    )

    await crawler.run(
        [
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25',
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
        ]
    )

...and there was no error :exploding_head: Can you please provide a better reproduction script?

Rigos0 commented 1 week ago

I can currently reproduce the error using this script.

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

from crawlee.router import Router

router = Router[BeautifulSoupCrawlingContext]()

@router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
    url = context.request.url
    print(f"Processing URL: {url}")

    html_content = context.soup.prettify()
    print(html_content)

async def main() -> None:
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run([
            'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
        ])

if __name__ == '__main__':
    asyncio.run(main())

Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset.

If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block" If the site gets scraped correctly, you should find 25 review blocks - one for each review.

In my case, I get this error and only 11/25 review blocks: image

janbuchar commented 4 days ago

Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset.

If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block" If the site gets scraped correctly, you should find 25 review blocks - one for each review.

I tried again, I'm getting 25 review blocks consistently and no errors.

Could you please try to save the HTML of the page that crashes beautifulsoup? We should try to make a reproduction example that doesn't depend on the current state of a website that changes this often.

Rigos0 commented 4 days ago

html_contents.zip

Got an error using offset=200.

The zip contains the original html source code and also the html scraped by the script from previous message. It contains only 3/25 blocks and some special chars are messed up.

janbuchar commented 4 days ago

This is strange. Could you dump the response headers of a failing request for me?

print(context.http_response.headers)
Rigos0 commented 4 days ago

root={'cache-control': 'private', 'content-encoding': 'br', 'content-length': '10185', 'content-security-policy-report-only': "frame-ancestors 'none'; report-uri https://nellie.booking.com/csp-report-uri?type=report&tag=112&pid=2a8173b6036f02ae&e=UmFuZG9tSVYkc2RlIyh9YeCr9sjcycwx2MIpyyQyTpmqV_3QMFueVZyxbPr4tb7Q", 'content-type': 'text/html; charset=UTF-8', 'date': 'Mon, 18 Nov 2024 16:27:25 GMT', 'nel': '{"max_age":604800,"report_to":"default"}', 'report-to': '{"group":"default","max_age":604800,"endpoints":[{"url":"https://nellie.booking.com/report"}]}', 'server': 'nginx', 'strict-transport-security': 'max-age=63072000; includeSubDomains; preload', 'vary': 'User-Agent, Accept-Encoding', 'via': '1.1 fbd2b51fce9ee4f3aa7b93dbbda3d698.cloudfront.net (CloudFront)', 'x-amz-cf-id': 'KhVqUbybzfGTSTkfDeZIo7vTNT1GvYPI0WPTTsfMMvZio5OpiTlFXw==', 'x-amz-cf-pop': 'FRA56-P8', 'x-cache': 'Miss from cloudfront', 'x-content-type-options': 'nosniff', 'x-recruiting': 'Like HTTP headers? Come write ours: https://careers.booking.com', 'x-xss-protection': '1; mode=block'}

janbuchar commented 4 days ago

Okay, nothing suspicious there. Could you also provide the complete stack trace of your error?

Rigos0 commented 3 days ago

The stack trace is unfortunately just encoding error : input conversion failed due to input error, bytes 0xF0 0x9F 0x98 0x89 encoding error : input conversion failed due to input error, bytes 0xF0 0x9F 0x98 0x89

I tried catching the error/warning with custom handling but that did not help. This is what Claude had to say about it: That's the entire error message, repeated twice. The bytes 0xF0 0x9F 0x98 0x89 represent a UTF-8 encoded emoji (the winking face 😉). There's no additional stack trace or context for these specific encoding errors because they're likely being generated at a lower level by the XML/HTML parser without proper Python exception handling.

Also just a note, I was thinking if I did something wrong with libraries or the script but then I remembered I've already replicated the error both locally and on the Apify platform.

janbuchar commented 3 days ago

Interesting. Could you link me to the run on the Apify platform then?