cldf / pycldf

python package to read and write CLDF datasets
https://cldf.clld.org
Apache License 2.0
15 stars 7 forks source link

consider simple `join` functionality #111

Closed SimonGreenhill closed 4 years ago

SimonGreenhill commented 4 years ago

It would be really helpful to have a simple join ability. Right now to load cognates for a Form, I'm doing something like

lexemes = {row['ID']: row for row in cldf['FormTable']}
for row in cldf['CognateTable']:
    lexemes[row['Language_ID']].update({ ...some values ... })

whereas something like:

for row in cldf[('LanguageTable', 'CognateTable')]:
    print(row)

This of course makes data loading more complicated and risks trying to reinvent a database engine (and what types of joins should be supported). But we do know the relationships from metadata.json and I'm worried I'll manually join things incorrectly.

xrotwang commented 4 years ago

I think that's too simplistic. There may be more relations between LanguageTable and CognateTable. One could say that this is a shortcut for the "standard" relation via FormTable->languageReference and CognateTable->formReference - but even then this breaks down for the case of borrowings.

I think the only syntax I could sort-of get behind for this kind of functionality would be the SQL that could also be used against the SQLite db created from a dataset - and I suspect that in many cases creating the sqlite db and then running the query on the fly will not be much slower. Let me experiment a bit.

xrotwang commented 4 years ago

This would be a bit like how csvsql works.

SimonGreenhill commented 4 years ago

Yeah, I can that it's a can of worms. I meant to experiment and compare how fast it was to load the CLDF iterating over csv and loading to an in-memory sqlite but haven't had a chance so am keen to see what you come up with.

xrotwang commented 4 years ago

I can't be any faster, I guess, since currently loading implies iterating over all rows in all tables. But if the db is in-memory, it might be only slightly slower.

xrotwang commented 4 years ago

Hm. First small test: Creating database for lexibank/abvd in memory and on disk are both about 35 secs. So writing to disk doesn't seem to add much (for a 75MB database).

SimonGreenhill commented 4 years ago

wow, and this should be one of the big datasets where any differences are noticeable. SQLite is amazing.

So it makes no sense to implement this in python, rather if you want to do this then convert to sqlite. wontfix!

SimonGreenhill commented 4 years ago

Oh -- do we have some documentation on how best to do this?

xrotwang commented 4 years ago

Hm. Not really. This is about it https://github.com/cldf/pycldf/blob/dcd5acd68a3ace91e9d8cb121b2eaf7132c1f78c/src/pycldf/db.py#L2-L36 Will add something to the pycldf README.

xrotwang commented 4 years ago

See https://github.com/cldf/pycldf/issues/112

SimonGreenhill commented 4 years ago

thanks!