apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.52k stars 665 forks source link

Idea: we could add function to extract schema.org microdata from a page #276

Open jancurn opened 5 years ago

jancurn commented 5 years ago

It could be called Apify.utils.puppeteer.extractMicrodata and look something like this: https://kb.apify.com/tips-and-tricks/scraping-data-from-websites-using-schemaorg-microdata

but ideally, it wouldn't use jQuery.

drobnikj commented 5 years ago

There are some npm packages, which can handle it. e.g https://www.npmjs.com/package/microdata-node.

mtrunkat commented 5 years ago

Inspiration might be @gippy's https://github.com/apifytech/act-page-analyzer implementation