Cache built kibblesets for later reuse

Viqsi commented 10 months ago

The most frequent slowdown we run into in regular operations is when a kibble set has to be built for a given variable. We're already running into situations where we have to "cache" these results (witness the proliferation of All-EXISTS cohorts for FHVs in particular), so it might be worthwhile to just go the next step and formalize this caching process.

This could potentially speed up creation of cohorts and custom variables and downloads (for example, previewing a potential cohort would populate the cache and thus make the ensuing creation - and subsequent previews - much faster), simplfy the process of envaluing a custom variable, and keep cohortInds from becoming an untenable slow monster. It also cleanly separates an existing de facto system caching function from vital user data. We'd just have to remember to purge the kibble cache any time variable data changes (not often), make sure the resulting cache table(s) is/are supremely well indexed, and we're golden.

Something to give some serious thought IMO.

Viqsi commented 4 months ago

I had a different concept of how this might work come to mind. As described to @WValenti in Skype (with some edits):

The idea amounts to taking that concept I already have with "SVVFHVs" (the persistent table that is actually a cache) and doing something similar with 'kibble' - creating a new table that basically pares a copy of 'kibble' down to the few bits DIVER actually cares about - new_ind_id, variable_name (as determined by equivalence_groups.first_member), value.

The reasons why that are coming to mind:

The kibble table is currently ~5.7GB on disk, a little over two thirds of which is indexes.

A "down to DIVER's essentials" copy would likely be smaller, because of the elimination of both 1) "invalid" rows (read: rows for ind_ids and variable_ids in interactive tables) and, more importantly, 2) indexes for extra columns that this pared-down copy wouldn't need.

It'd have to be "rebuilt" each time we add data, but we're already having to rebuild and recheck things with additions and revisions to distributions. So make it a formal process and designate what specifically we use it for for DIVER, and leave the DIGS stuff (which is otherwise tracking necessary, vital, and yet convoluted information) out of it.

And that way any kibble operations - cohort creation, custom variable creation, download creation - is ostensibly much more simple, because the number of lookups needed for each row is drastically reduced and we don't have to do it with temporary generated stuff on the fly.

Heck, it'd eliminate the need for AECs for existing (non-custom) variables entirely.

So yeah. a thought coming to mind. I think it achieves the objective of Don't Have A Giant Sparse Table while still making DIVER work and while not impacting the DIGS backend itself.

(It'd probably also make deployment less of a headache for ISI.)

WValenti commented 4 months ago

As discussed in person, kibble is subject to extensive development for the addition of longitudinal information. This is going to be a long-term project as it requires significant noodling first, and will involve UI changes as well.

WValenti commented 4 months ago

Also, I believe in a cacheless society. Gimme credit for my experience - cache causes problems. ;-)

Viqsi commented 4 months ago

I mean, I would prefer not to have "cache tables" myself, but until MariaDB/MySQL give us native support for materialized views, the performance penalties are way too severe. :(

MathematicalMedicine / diver-issues

Cache built kibblesets for later reuse #229