How to split up data processing scripts/queries?

vincerubinetti commented 1 year ago

From slack:

we have a lot of data that needs to be sliced/faceted in various ways, and for various contexts (the paper, the website, the r package/data download), and more importantly there is overlap in the slices needed over these contexts (e.g. there is a bit of more complex processing/analysis to get the taxonomic prevalance data, which is both needed for the website and the paper). we should automate these with scripts somewhere, but where is the question. website will probably need rapidly changing data structures to match implementation details, so that's an argument to put that pre-processing in the website repo (with node.js), but then we might end up duplicating processing scripts between the dataset/paper and website.

In zoom we had also talked about possibly even just having the new single database file be duplicately stored in the website repo, and have the node js script compile the info by directly making SQL queries. This is not off the table completely, but I'm now leaning toward having all of the data processing for everything be colocated so there is no duplication.

Rich, can you look into having extra scripts with your existing scripts that give me the data I need, which can be found in typescript schemas in src/data/index.ts. I will try to keep my schemas minimal and stable so I don't have to request new/updated scripts from you all the time. Hopefully you can do this very "close" to the database, with just nice SQL queries and minimal script processing. Being able to export from the database as json would be nice too, but it's also not a big deal to have one small extra step somewhere in the pipeline (database post-process script, website pre-compile script, in-browser conversion) to convert csv to json.

vincerubinetti commented 1 year ago

And as a good practice, as new features get suggested, we should try to define a schema for what it will include before actually writing a processing script and implementing it. For example in #9, we should try to decide what fields will be in the final json before doing anything else, so you don't have to go back and update the scripts constantly.

This is to try to avoid this problem:

website will probably need rapidly changing data structures to match implementation details

vincerubinetti commented 1 year ago

Eventually, then, I guess the idea is that I would get rid of as much of my pre-processing in /data as possible, and have the database-side scripts provide as much of /public/*.json as possible, except for the Natural Earth parts of the data.

vincerubinetti commented 1 year ago

I think we've landed on this. The paper folks handle the scripts they need for the paper. I handle the scripts needed for the website, deriving it directly from the data of record on Zenodo. There may be some duplication of processing there, but I think it's worth it to reduce the workload for the paper folks, and also allows me to have greater/quicker control over the data format i need.

As an enhancement, in the future, we can try to remove some of my script code, and replace it with direct SQL queries.

blekhmanlab / compendium_website

How to split up data processing scripts/queries? #7