Open Rigos0 opened 1 week ago
Hello @Rigos0! I tried reproducing this with the following snippet:
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
from .routes import router
async def main() -> None:
"""The crawler entry point."""
crawler = BeautifulSoupCrawler(
request_handler=router,
max_requests_per_crawl=50,
http_client=CurlImpersonateHttpClient(),
)
await crawler.run(
[
'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25',
'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
]
)
...and there was no error :exploding_head: Can you please provide a better reproduction script?
I can currently reproduce the error using this script.
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
from crawlee.router import Router
router = Router[BeautifulSoupCrawlingContext]()
@router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
url = context.request.url
print(f"Processing URL: {url}")
html_content = context.soup.prettify()
print(html_content)
async def main() -> None:
crawler = BeautifulSoupCrawler(request_handler=router)
await crawler.run([
'https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50'
])
if __name__ == '__main__':
asyncio.run(main())
Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset.
If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block" If the site gets scraped correctly, you should find 25 review blocks - one for each review.
In my case, I get this error and only 11/25 review blocks:
Note: The scraped website will change in time as we are using the offset parameter - new reviews will push the problematic ones to higher offset.
If you are not getting any errors, please verify if you scraped the entire html by looking for the string c-review-block" If the site gets scraped correctly, you should find 25 review blocks - one for each review.
I tried again, I'm getting 25 review blocks consistently and no errors.
Could you please try to save the HTML of the page that crashes beautifulsoup? We should try to make a reproduction example that doesn't depend on the current state of a website that changes this often.
Got an error using offset=200.
The zip contains the original html source code and also the html scraped by the script from previous message. It contains only 3/25 blocks and some special chars are messed up.
This is strange. Could you dump the response headers of a failing request for me?
print(context.http_response.headers)
root={'cache-control': 'private', 'content-encoding': 'br', 'content-length': '10185', 'content-security-policy-report-only': "frame-ancestors 'none'; report-uri https://nellie.booking.com/csp-report-uri?type=report&tag=112&pid=2a8173b6036f02ae&e=UmFuZG9tSVYkc2RlIyh9YeCr9sjcycwx2MIpyyQyTpmqV_3QMFueVZyxbPr4tb7Q", 'content-type': 'text/html; charset=UTF-8', 'date': 'Mon, 18 Nov 2024 16:27:25 GMT', 'nel': '{"max_age":604800,"report_to":"default"}', 'report-to': '{"group":"default","max_age":604800,"endpoints":[{"url":"https://nellie.booking.com/report"}]}', 'server': 'nginx', 'strict-transport-security': 'max-age=63072000; includeSubDomains; preload', 'vary': 'User-Agent, Accept-Encoding', 'via': '1.1 fbd2b51fce9ee4f3aa7b93dbbda3d698.cloudfront.net (CloudFront)', 'x-amz-cf-id': 'KhVqUbybzfGTSTkfDeZIo7vTNT1GvYPI0WPTTsfMMvZio5OpiTlFXw==', 'x-amz-cf-pop': 'FRA56-P8', 'x-cache': 'Miss from cloudfront', 'x-content-type-options': 'nosniff', 'x-recruiting': 'Like HTTP headers? Come write ours: https://careers.booking.com', 'x-xss-protection': '1; mode=block'}
Okay, nothing suspicious there. Could you also provide the complete stack trace of your error?
The stack trace is unfortunately just encoding error : input conversion failed due to input error, bytes 0xF0 0x9F 0x98 0x89 encoding error : input conversion failed due to input error, bytes 0xF0 0x9F 0x98 0x89
I tried catching the error/warning with custom handling but that did not help. This is what Claude had to say about it:
That's the entire error message, repeated twice. The bytes 0xF0 0x9F 0x98 0x89 represent a UTF-8 encoded emoji (the winking face 😉). There's no additional stack trace or context for these specific encoding errors because they're likely being generated at a lower level by the XML/HTML parser without proper Python exception handling.
Also just a note, I was thinking if I did something wrong with libraries or the script but then I remembered I've already replicated the error both locally and on the Apify platform.
Interesting. Could you link me to the run on the Apify platform then?
When running a crawler using the BeautifulSoupCrawlingContext, I am getting unfixable encoding errors. They are thrown even before the handler function is called.
"encoding error : input conversion failed due to input error, bytes 0xEB 0x85 0x84 0x20"
The error occurs in about 30% of requests when trying to scrape reviews from booking. Some example links for replication:
https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=25
https://www.booking.com/reviewlist.en-gb.html?cc1=cz&pagename=hotel-don-giovanni-prague&rows=25&sort=f_recent_desc&offset=50
I found relevant issue stating "Libxml2 does not support the GB2312 encoding so a way to get around this problem is to convert it to utf-8. I did it and it works for me:" https://github.com/mitmproxy/mitmproxy/issues/657 but I did not manage to fix your BeautifulSoupCrawlingContext code by specifying the encoding.