Deduplicate attribute key/value strings

COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.

https://cosima-recipes.readthedocs.io/en/latest/

Apache License 2.0

58 stars 25 forks source link

Deduplicate attribute key/value strings #241

Closed angus-g closed 3 years ago

angus-g commented 3 years ago

We were adding NetCDF file- and variable-level attributes unconditionally to the ncattributes table. Because of the vast amount of duplication in these attribute names and values, our only hope to not blow up the database size was that sqlite itself would adequately compress things, but this didn't seem to be the case.

We add a new table, to intern the (unique) string values. This needs a little bit of sqlalchemy machinery to let us seamlessly add attributes through the usual mapped collection interface. This involves a similar mechanism to the exsting UniqueMixin, but which is automatically fired through sqlalchemy's event system.

Closes #240.

[ ] check DB filesize difference before attributes, and before/after this patch (@aidanheerdegen which experiment was your test mentioned in https://github.com/COSIMA/cosima-cookbook/issues/240#issue-841488293 on?)

aidanheerdegen commented 3 years ago

Test was on /g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091_weddell_up1. Maybe not a good test as it has changed and has 5 new files since I ran it last. I have re-run with currently installed version in conda, master and this PR.

	`conda`	`#pr241`	`master`
size	1.9M	8.0M	14.0M
time	00.43	01:06	03:48

Indexing was a lot faster, 01:06 v 03:48 (not sure if there is any file caching effect in there).

DB size is a lot smaller without any attributes, just a big table even with only indices, but a big reduction with the new PR. Current cosima_master.db is 532M. I guess we'd expect this to increase to 2-3GB. Sounds manageable.

angus-g commented 3 years ago

Are your master and pr#241 columns swapped?

aidanheerdegen commented 3 years ago

Are your master and pr#241 columns swapped?

No ;)

aidanheerdegen commented 3 years ago

I think the only way the DB size could be significantly reduced would be to add a small subset of attributes to the ncvars table. This would be schema breaking I guess, and would be less flexible and restrict queries/filtering to only those attributes.

I'm not saying that is the way it should be done, just thought I'd say it out loud.

angus-g commented 3 years ago

Could possibly keep the more general layout, but whitelist attributes in metadata.yaml or similar?

aidanheerdegen commented 3 years ago

Honestly I didn't want to make more work for you, just wanted to make sure all the bases had been covered.

Just a little disconcerted the DB increases 4x in size for some attributes many of which we're never going to care much about I suspect.

Whitelisting is an interesting idea. Could define default whitelist and add to it with metadata.yaml or similar I guess?

angus-g commented 3 years ago

Another way to look at it might be that 4x is still not huge, necessarily. As long as index and querying times aren't much affected, it doesn't really hurt to have as much available as possible.

Could define default whitelist and add to it with metadata.yaml or similar I guess?

That'd be a pretty reasonable way to do it, yeah.

aidanheerdegen commented 3 years ago

Another way to look at it might be that 4x is still not huge, necessarily. As long as index and querying times aren't much affected, it doesn't really hurt to have as much available as possible.

Agreed. At the very least it seems fine now (thought I haven't tested query times). So best to go with this and address later if there is an issue.

Which is a long way of saying ignore everything I just said ... :)

aidanheerdegen commented 3 years ago

So merge I reckon.