Spin off scrapers into their own package

digitalmethodsinitiative / 4cat

The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.

Other

263 stars 62 forks source link

Spin off scrapers into their own package #66

Open stijn-uva opened 5 years ago

stijn-uva commented 5 years ago

4chan is the odd one out now, being the only of the many datasources that has its own scraper. Works fine, but it might make more sense then to spin the scraper off into its own thing. This would also make it easier to separate the data store from the analytical part of 4CAT, and uncoupling them would protect the scraper from crashes originating within the rest of 4CAT

stijn-uva commented 2 years ago

Current plan: a separate tool/package that collects data, stores it as scraped in MongoDB, indexes it with ElasticSearch, and makes it available to 4CAT through a light-weight API that returns full documents for a given ElasticSearch query.