cldf / pycldf

python package to read and write CLDF datasets
https://cldf.clld.org
Apache License 2.0
15 stars 7 forks source link

first stab at orm #123

Closed xrotwang closed 3 years ago

xrotwang commented 3 years ago

For usage examples see https://github.com/cldf/pycldf/issues/121#issuecomment-758635030 and the tests.

closes #121

xrotwang commented 3 years ago

I still have to flesh things out and write more tests. So don't merge this.

codecov-io commented 3 years ago

Codecov Report

Merging #123 (763998c) into master (776a5e2) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            master      #123    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           22        24     +2     
  Lines         1750      2029   +279     
==========================================
+ Hits          1750      2029   +279     
Impacted Files Coverage Δ
src/pycldf/dataset.py 100.00% <100.00%> (ø)
src/pycldf/orm.py 100.00% <100.00%> (ø)
src/pycldf/sources.py 100.00% <100.00%> (ø)
src/pycldf/terms.py 100.00% <100.00%> (ø)
src/pycldf/util.py 100.00% <100.00%> (ø)
tests/conftest.py 100.00% <100.00%> (ø)
tests/test_cli.py 100.00% <100.00%> (ø)
tests/test_dataset.py 100.00% <100.00%> (ø)
tests/test_orm.py 100.00% <100.00%> (ø)
tests/test_util.py 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 776a5e2...763998c. Read the comment docs.

SimonGreenhill commented 3 years ago

One thing to consider -- @XachaB had to resort to a FlyWeight class in the Sound Correspondences project (https://github.com/lexibank/lexitools/pull/14/commits/7ba360ff5590456b2dee8daabcd78b02e7d12304) as his objects were getting very memory hungry. Is that something useful here?

xrotwang commented 3 years ago

As far as I can tell, the FlyWeight class only makes sure that objects with the same data are only created once. My code caches objects after instantiation on dataset level, which should have the same effect. I wouldn't want to go further than that in terms of optimizing memory usage, though, because I think that's the basic trade-off of an ORM - more convenience, less memory efficiency. And ORM methods can still be used to make loading multiple datasets into a SQLite db more convenient.

SimonGreenhill commented 3 years ago

Cool - just wondering if we could nip a potential problem in the bud. Agree that premature optimisation is bad.

xrotwang commented 3 years ago

So here are some numbers. I compared the memory footprint of some datasets, loading all CLDF components

csv orm
apics 114 MB 152 MB
wals 144 MB 221 MB
ids 1 GB 1.4 GB

So it looks like orm adds 30-40% memory footprint. I think that's acceptable, i.e. would not optimize to just cache raw dicts and instantiate objects on access only.

It's clear, though, that any "holding all data in memory" approach is limited (e.g. - extrapolating from IDS - to about 1 mil forms per 2 GB memory for Wordlist data). Maybe we should publish some benchmarks along these lines somewhere.

SimonGreenhill commented 3 years ago

thanks for the benchmark. I guess for datasets of the size of IDS we shouldn't be reading it all into memory at once anyway.

For the orm, that doesn't look like a big overhead either for the added functionality (and if it does become a problem I suspect you'd get more savings by replacing the OrderedDict with Dict (especially since dicts are now ordered) or even List/Tuple.

xrotwang commented 3 years ago

The thing with the ORM functionality is, you basically have to read it all in, to be able to resolve relations with somewhat predictable performance. But I think that's ok, because the limitations are transparent: If it all fits in memory, you can use ORM, otherwise use something else. I definitely wouldn't want to implement something like SQLAlchemy's Session here (and if I had, I'd actually use SQLAlchemy :) )