Closed Ehsan-U closed 7 months ago
Puppeteer and Playwright support xpath selectors (https://playwright.dev/docs/locators#locate-by-css-or-xpath)
At a quick glance, I don't see cheerio supporting it, but maybe we can potentially look into it for cheerio too
As Vlad already mentioned, this is supported natively in browsers and you can use that. Crawlee does not "support selectors", it only abstracts use of other libraries like cheerio that handle things like that. I will close this as it feels not actionable, if you have some library on your mind that you'd like us to wrap, that's a different story.
At a quick glance, I don't see cheerio supporting it, but maybe we can potentially look into it for cheerio too
That feels outside of scope, and looking at the cheerio issues this idea got rejected on their end too.
Similar to Scrapy's parsing support, which remains independent of whether you use playwright, puppeteer, or any other method to fetch HTML, having a universal parsing method is a great step. Ultimately, we're dealing with HTML that needs to be parsed, so the method of obtaining it, whether through playwright or puppeteer, shouldn't determine our parsing approach.
Apologies if I came across as nerdy. 🙏
We already provide a universal parsing method via cheerio, all crawler types have a parseWithCheerio
context helper for that.
As I mentioned, if you have some library for XPATH support on your mind that we could wrap, we can surely do that.
Which package is the feature request for? If unsure which one to select, leave blank
None
Feature
I've found that XPATH selectors are a very powerful tool in web scraping, yet Crawlee currently lacks support for them. XPATH dominates in the Python web scraping community, and Scrapy also supports them.
Motivation
I recently completed my first project using Crawlee and struggled with writing CSS selectors. I believe this is a common challenge for many developers who come from a Python Scrapy background.
Ideal solution or implementation, and any additional constraints
Not Sure
Alternative solutions or implementations
No response
Other context
No response