Item type: data pipeline

Description:Running a web scraper on Google Cloud Platform (GCP) to capture and store text data obtained from scraping websites. The goal is to efficiently gather and persist scraped text data on Google Cloud in docker container deployed. The scraped text files will be stored on GC and utilized for prompt engineering within the LangChain framework, facilitating AI agent response generation.

User Story

As a developer,
I want to run the scrapping functionality on Google Cloud Platform
so that I can collect and store text data from various sources in Google cloud, which will be used for prompt engineering in the LangChain framework to generate responses for AI agents.

Acceptance Criteria

[ ] Deploy the scrapping functionality on Google Cloud Platform to perform automated data extraction from multiple sources.
[ ] Ensure the scraper handles large-scale scraping operations reliably and efficiently.
[ ] Configure the scraper to save scraped text files directly to Google Cloud Storage (GCS).
[ ] Implement batch processing or streaming capabilities as needed to manage large volumes of data.
[ ] Implement error handling and retry mechanisms to ensure robust performance and data integrity.

Definition of Done

[ ] The feature has been fully implemented.
[ ] The feature has been manually tested and works as expected without critical bugs.
[ ] The feature code is documented with clear explanations of its functionality and usage.
[ ] The feature code has been reviewed and approved by at least one team member.
[ ] The feature branches have been merged into the main branch and closed.
[ ] The feature utility, function and usage have been documented in the respective project wiki on github.

amosproj / amos2024ss06-health-ai-framework

Run scrapper on Google Cloud #238

Item type: data pipeline

User Story

Acceptance Criteria

Definition of Done