Closed xrotwang closed 3 years ago
I still have to flesh things out and write more tests. So don't merge this.
Merging #123 (763998c) into master (776a5e2) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #123 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 22 24 +2
Lines 1750 2029 +279
==========================================
+ Hits 1750 2029 +279
Impacted Files | Coverage Δ | |
---|---|---|
src/pycldf/dataset.py | 100.00% <100.00%> (ø) |
|
src/pycldf/orm.py | 100.00% <100.00%> (ø) |
|
src/pycldf/sources.py | 100.00% <100.00%> (ø) |
|
src/pycldf/terms.py | 100.00% <100.00%> (ø) |
|
src/pycldf/util.py | 100.00% <100.00%> (ø) |
|
tests/conftest.py | 100.00% <100.00%> (ø) |
|
tests/test_cli.py | 100.00% <100.00%> (ø) |
|
tests/test_dataset.py | 100.00% <100.00%> (ø) |
|
tests/test_orm.py | 100.00% <100.00%> (ø) |
|
tests/test_util.py | 100.00% <100.00%> (ø) |
|
... and 2 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 776a5e2...763998c. Read the comment docs.
One thing to consider -- @XachaB had to resort to a FlyWeight
class in the Sound Correspondences project (https://github.com/lexibank/lexitools/pull/14/commits/7ba360ff5590456b2dee8daabcd78b02e7d12304) as his objects were getting very memory hungry. Is that something useful here?
As far as I can tell, the FlyWeight
class only makes sure that objects with the same data are only created once. My code caches objects after instantiation on dataset level, which should have the same effect. I wouldn't want to go further than that in terms of optimizing memory usage, though, because I think that's the basic trade-off of an ORM - more convenience, less memory efficiency. And ORM methods can still be used to make loading multiple datasets into a SQLite db more convenient.
Cool - just wondering if we could nip a potential problem in the bud. Agree that premature optimisation is bad.
So here are some numbers. I compared the memory footprint of some datasets, loading all CLDF components
OrderedDict
s, as returned from csvw.dsv.UnicodeReader
pycldf.orm.Object
instancescsv | orm | |
---|---|---|
apics | 114 MB | 152 MB |
wals | 144 MB | 221 MB |
ids | 1 GB | 1.4 GB |
So it looks like orm adds 30-40% memory footprint. I think that's acceptable, i.e. would not optimize to just cache raw dict
s and instantiate objects on access only.
It's clear, though, that any "holding all data in memory" approach is limited (e.g. - extrapolating from IDS - to about 1 mil forms per 2 GB memory for Wordlist data). Maybe we should publish some benchmarks along these lines somewhere.
thanks for the benchmark. I guess for datasets of the size of IDS we shouldn't be reading it all into memory at once anyway.
For the orm, that doesn't look like a big overhead either for the added functionality (and if it does become a problem I suspect you'd get more savings by replacing the OrderedDict with Dict (especially since dicts are now ordered) or even List/Tuple.
The thing with the ORM functionality is, you basically have to read it all in, to be able to resolve relations with somewhat predictable performance. But I think that's ok, because the limitations are transparent: If it all fits in memory, you can use ORM, otherwise use something else. I definitely wouldn't want to implement something like SQLAlchemy's Session here (and if I had, I'd actually use SQLAlchemy :) )
For usage examples see https://github.com/cldf/pycldf/issues/121#issuecomment-758635030 and the tests.
closes #121