Automating dataset extraction (pipeline) and proper storage

Currently, data will be collecting and combining in pulldata branch and still run manually every certain period of time or when needed. It might be run by event triggered action in CI/CD for a while during development. But, must be needed more proper workflow (and tech stack?) and proper storage. Not parquet files dataset. The size is getting bigger and bigger.

Task:

[x] Database / PostgreSQL / Supabase
[ ] Temporary storage (data acquisition)
- [x] 1. HTML > gzip > MongoDB (blob)
- [ ] 2. HTML > gzip > Supabase storage
[ ] Parsing HTML
[x] Direct structured data from scraper
[ ] Triggered pipeline from temp storage to database

akherlan / momoshop

Automating dataset extraction (pipeline) and proper storage #16