PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
15.94k stars 1.56k forks source link

3.x QA: Web Scraping #13807

Open zangell44 opened 3 months ago

zangell44 commented 3 months ago

Description

Use Prefect to build the following web scraping functionality, noting failures and frictions:

We create data products for hedge funds by scraping e-commerce webpages and enriching the data with Marvin. I want to scrape a page whenever Visualping detects a change to that page. The scrapers are based off the DOM of each webpage. When I scrape a page, I don’t know if the content changed, or the DOM itself changed. If the DOM changed, my scraper will fail, but it could also fail because its been rate limited. I want my scrapers to be as robust to rate limiting as possible. If the DOM changed, I want to try to proceed by having Marvin try to extract the schema I need from the raw HTML

Impact

No response

Additional context

No response

znicholasbrown commented 3 months ago

Work for this is happening here - instead of building on Marvin I'm building on ControlFlow - if Marvin is a hard requirement I can incorporate that instead. I've begun incorporating retries and CF agents for the webscraping tasks and have also incorporated artifacts for the final "product"