Some inconclusive thoughts about the necessity of a database:
I have an initial aversion against databases (as introducing unnecessary complexity) and would prefer some simple joblib-based caching for storing count data from external APIs.
But considering that we also want to work with fulltexts, a database might make a lot of sense for these, because with a database it will be much easier and faster to query for keywords over large amounts of texts.
But then again in most cases we don't want the fulltexts to do keyword queries, but rather to do topic modeling and sentiment analysis, and a database won't bring a speedup for these.
AFAIS, the only cases where we want to work with full texts and do not have an external API for keyword queries are:
self-scraped newspaper articles (if we go for that) (we do have alternative newspaper APIs with keyword query functionalities)
social media analyses (here it is still a bit unclear what sort of data we can get)
parliamentary speech and policy
One question is also whether we want to work with a fixed pre-defined set of keyword queries (we can also pre-compute that slowly every day) or whether we want to allow flexible user-defined keyword queries (then a database would definitely help). I tend to the latter.
Some inconclusive thoughts about the necessity of a database:
I have an initial aversion against databases (as introducing unnecessary complexity) and would prefer some simple joblib-based caching for storing count data from external APIs.
But considering that we also want to work with fulltexts, a database might make a lot of sense for these, because with a database it will be much easier and faster to query for keywords over large amounts of texts.
But then again in most cases we don't want the fulltexts to do keyword queries, but rather to do topic modeling and sentiment analysis, and a database won't bring a speedup for these.
AFAIS, the only cases where we want to work with full texts and do not have an external API for keyword queries are:
One question is also whether we want to work with a fixed pre-defined set of keyword queries (we can also pre-compute that slowly every day) or whether we want to allow flexible user-defined keyword queries (then a database would definitely help). I tend to the latter.