Open stijn-uva opened 5 years ago
Current plan: a separate tool/package that collects data, stores it as scraped in MongoDB, indexes it with ElasticSearch, and makes it available to 4CAT through a light-weight API that returns full documents for a given ElasticSearch query.
4chan is the odd one out now, being the only of the many datasources that has its own scraper. Works fine, but it might make more sense then to spin the scraper off into its own thing. This would also make it easier to separate the data store from the analytical part of 4CAT, and uncoupling them would protect the scraper from crashes originating within the rest of 4CAT