External processing based on the CSV export

eric-burel commented 11 months ago

As we have more and more voluminous and insightful data, some data analysis usages are coming up and could be harder to solve in JavaScript than in other languages, namely Python.

For example:

extracting keywords out of freeform content automatically
running Named Entity Recognition to identify entities
detecting sentiments from freeform content (yeah you got it, text content is a big deal)
clustering users based on their response

Nowadays, all those use cases can be solved in a few lines of Pythons with the right library, based on a CSV extract of the data, and it's usually harder to find JavaScript counterparts or to get the data in the right shape with Mongo aggregations.

Here is a proposal to allow such analysis: 1) Facilitate generating a CSV export. I've improved the surveyadmin interface, and we are able to generate a "mongoexport" command with the right column names based on the survey outline. This works surprisingly well 2) Setup a minimal infrastructure to run some Python logic over this CSV export to output a new CSV. The trickiest part is setting up Python reliably, this is really a nightmare. I think VSCode Docker containers can add consistency, and remove the need to actually install Python in the right version etc. If a script ends up being particularly useful, we can generate a binary out the Python code too. 3) Have logic to merge back the new data into the Mongo database. The format could be CSV, with a column recalling the response "_id" in the database and more columns with the additional data we want to store for each response. This could be done from surveyadmin. If we stick to adding new columns, we don't risk too much to alter the normalized data. Or we can use a separate collection, whatever we want. Using the "normalized response" _id might not always be reliable, as there are some issues around string ids etc. (namely when updating a normalized response, we can't guarantee that the _id is stable due to dumb limitations in Mongo replace operation), so we might allow to optionnaly use an other field like the linked responseId. 4) Finally those new fields could be used in the results app or to feed separate apps.

This obviously add some complexity to our architecture, given that the JavaScript part still need a lot of work, but perhaps less complexity than trying to achieve the same use cases in pure JS. And the possibilities it opens up can be quite cool.

SachaG commented 11 months ago

Have logic to merge back the new data into the Mongo database. The format could be CSV, with a column recalling the response "_id" in the database and more columns with the additional data we want to store for each response. This could be done from surveyadmin. If we stick to adding new columns, we don't risk too much to alter the normalized data. Or we can use a separate collection, whatever we want.

If we are merging back the results of calculations, we would probably not store them in the same dataset as our raw data. I imagine we'd have a separate collection that would act as a cache for them? In fact we could use our existing Redis cache.

eric-burel commented 11 months ago

I was thinking about one-off computation only at the moment, so they would be computed once for all rather than being cached. Then yes it would make sense to store them separately to be clear about what we can compute easily in JS and what could be "bonus" (at least for now).

Devographics / Monorepo

External processing based on the CSV export #281