apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.87k stars 685 forks source link

Add more hooks / events for crawler lifecycle #1859

Open prenaissance opened 1 year ago

prenaissance commented 1 year ago

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/core

Feature

Add ways to subscribe to crawler lifecycle events. Exporting the scraped data is probably one of the most common actions to do after a crawler runs and adding an idiomatic way to handle that would be great. With this addition, crawlers could be more self contained which would be beneficial to a project with multiple crawlers.

Motivation

Lifecycle hooks could be used using different strategies for exporting data. Ex (one crawler sends e-commerce products in csv to a datalake, another crawler sends e-commerce companies information to a database). Another use case would be to use the for resiliency. Add a handler to send a "starting" message to a temporary storage and a handler for a "finished" message. If the crawler crashes, a retry strategy can be used.

Ideal solution or implementation, and any additional constraints

Add the handlers to the constructor options:

const crawler = new JSDOMCrawler({
  handleStartCrawl: function1,
  handleEndCrawl: function2,
  // ...
})

or add events to the crawler

crawler.on("start", function1);

Alternative solutions or implementations

Alternative solution would be to make a wrapper composite class / type, with the crawler and the lifecycle event handlers. Another solution would be to switch to a monorepo and make an app for each crawler.

Other context

No response

Eleskovic commented 10 months ago

Is there any workarounds currently? I want to skip saving anything into storage and I'd like to implement my own storage mechanism with this. Is it possible?

prenaissance commented 10 months ago

Is there any workarounds currently? I want to skip saving anything into storage and I'd like to implement my own storage mechanism with this. Is it possible?

Setting the option persistStorage to false did the trick for me and there is a corresponding ENV variable for that too.