Closed angus-g closed 3 years ago
Test was on /g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091_weddell_up1
. Maybe not a good test as it has changed and has 5 new files since I ran it last. I have re-run with currently installed version in conda
, master
and this PR.
conda |
#pr241 |
master |
|
---|---|---|---|
size | 1.9M | 8.0M | 14.0M |
time | 00.43 | 01:06 | 03:48 |
Indexing was a lot faster, 01:06 v 03:48 (not sure if there is any file caching effect in there).
DB size is a lot smaller without any attributes, just a big table even with only indices, but a big reduction with the new PR. Current cosima_master.db
is 532M
. I guess we'd expect this to increase to 2-3GB. Sounds manageable.
Are your master
and pr#241
columns swapped?
Are your
master
andpr#241
columns swapped?
No ;)
I think the only way the DB size could be significantly reduced would be to add a small subset of attributes to the ncvars
table. This would be schema breaking I guess, and would be less flexible and restrict queries/filtering to only those attributes.
I'm not saying that is the way it should be done, just thought I'd say it out loud.
Could possibly keep the more general layout, but whitelist attributes in metadata.yaml
or similar?
Honestly I didn't want to make more work for you, just wanted to make sure all the bases had been covered.
Just a little disconcerted the DB increases 4x in size for some attributes many of which we're never going to care much about I suspect.
Whitelisting is an interesting idea. Could define default whitelist and add to it with metadata.yaml
or similar I guess?
Another way to look at it might be that 4x is still not huge, necessarily. As long as index and querying times aren't much affected, it doesn't really hurt to have as much available as possible.
Could define default whitelist and add to it with
metadata.yaml
or similar I guess?
That'd be a pretty reasonable way to do it, yeah.
Another way to look at it might be that 4x is still not huge, necessarily. As long as index and querying times aren't much affected, it doesn't really hurt to have as much available as possible.
Agreed. At the very least it seems fine now (thought I haven't tested query times). So best to go with this and address later if there is an issue.
Which is a long way of saying ignore everything I just said ... :)
So merge I reckon.
We were adding NetCDF file- and variable-level attributes unconditionally to the
ncattributes
table. Because of the vast amount of duplication in these attribute names and values, our only hope to not blow up the database size was that sqlite itself would adequately compress things, but this didn't seem to be the case.We add a new table, to intern the (unique) string values. This needs a little bit of sqlalchemy machinery to let us seamlessly add attributes through the usual mapped collection interface. This involves a similar mechanism to the exsting
UniqueMixin
, but which is automatically fired through sqlalchemy's event system.Closes #240.