apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
14.9k stars 620 forks source link

XPATH selectors support #2320

Closed Ehsan-U closed 7 months ago

Ehsan-U commented 7 months ago

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

I've found that XPATH selectors are a very powerful tool in web scraping, yet Crawlee currently lacks support for them. XPATH dominates in the Python web scraping community, and Scrapy also supports them.

Motivation

I recently completed my first project using Crawlee and struggled with writing CSS selectors. I believe this is a common challenge for many developers who come from a Python Scrapy background.

Ideal solution or implementation, and any additional constraints

Not Sure

Alternative solutions or implementations

No response

Other context

No response

vladfrangu commented 7 months ago

Puppeteer and Playwright support xpath selectors (https://playwright.dev/docs/locators#locate-by-css-or-xpath)

At a quick glance, I don't see cheerio supporting it, but maybe we can potentially look into it for cheerio too

B4nan commented 7 months ago

As Vlad already mentioned, this is supported natively in browsers and you can use that. Crawlee does not "support selectors", it only abstracts use of other libraries like cheerio that handle things like that. I will close this as it feels not actionable, if you have some library on your mind that you'd like us to wrap, that's a different story.

At a quick glance, I don't see cheerio supporting it, but maybe we can potentially look into it for cheerio too

That feels outside of scope, and looking at the cheerio issues this idea got rejected on their end too.

Ehsan-U commented 7 months ago

Similar to Scrapy's parsing support, which remains independent of whether you use playwright, puppeteer, or any other method to fetch HTML, having a universal parsing method is a great step. Ultimately, we're dealing with HTML that needs to be parsed, so the method of obtaining it, whether through playwright or puppeteer, shouldn't determine our parsing approach.

Apologies if I came across as nerdy. 🙏

B4nan commented 7 months ago

We already provide a universal parsing method via cheerio, all crawler types have a parseWithCheerio context helper for that.

As I mentioned, if you have some library for XPATH support on your mind that we could wrap, we can surely do that.