apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.85k stars 683 forks source link

Update main examples to include DOM manipulation #454

Closed mtrunkat closed 2 years ago

mtrunkat commented 5 years ago

Main examples at Apify SDK webpage, Github repo and CLI templates should demonstrate how to manipulate with DOM and retrieve data from it.

Also add one example of scraping with Apify SDK + jQuery to https://sdk.apify.com/docs/examples/basiccrawler

Feedback from: https://medium.com/better-programming/do-i-need-python-scrapy-to-build-a-web-scraper-7cc7cac2081d

I lost an hour trying to make a simple page parsed with Apify SDK, trying to understand how to access the DOM and selectors. If you want a great crawler this might work for you but you need to understand its particular logic and I didn’t have time for it.

jancurn commented 5 years ago

This is related to https://github.com/apifytech/apify-cli/issues/23. Basically we should have a single shared list of well-prepared project templates, and use them in CLI, app and Apify SDK examples.

B4nan commented 2 years ago

I guess we can also close this one