SocialChangeLab / media-impact-monitor

The Media Impact Monitor will be a novel tool for protest groups and NGOs to measure and visualize their impact on public discourse.
https://mediaimpactmonitor.app
Other
37 stars 1 forks source link

[ORG] Discuss data pipeline / fetching / API #26

Closed kleinlennart closed 4 months ago

davidpomerenke commented 7 months ago

Some inconclusive thoughts about the necessity of a database:

I have an initial aversion against databases (as introducing unnecessary complexity) and would prefer some simple joblib-based caching for storing count data from external APIs.

But considering that we also want to work with fulltexts, a database might make a lot of sense for these, because with a database it will be much easier and faster to query for keywords over large amounts of texts.

But then again in most cases we don't want the fulltexts to do keyword queries, but rather to do topic modeling and sentiment analysis, and a database won't bring a speedup for these.

AFAIS, the only cases where we want to work with full texts and do not have an external API for keyword queries are:

One question is also whether we want to work with a fixed pre-defined set of keyword queries (we can also pre-compute that slowly every day) or whether we want to allow flexible user-defined keyword queries (then a database would definitely help). I tend to the latter.