cldf / csvw

CSV on the web
Apache License 2.0
36 stars 5 forks source link

missing db.commit() in src/csvw/db.py #43

Closed PaprikaSteiger closed 3 years ago

PaprikaSteiger commented 3 years ago

Hello xrotwang

@Anaphory and I have been working on the lexedata project for looping between cldf and excel format. We relied on the pycldf and csvw packages for the cldf output and the SQlite support. I think there might be a small bug in the src/csvw/db.py module. def insert(db, translate, table, keys, *rows, **kw): does not contain a db.commit() statement. In our case, the data was not inserted into the database until we added a db.commit() after executing csvw.db.insert().

Kind Regards

xrotwang commented 3 years ago

Hm. Are you calling pycldf.db.insert in your own code? Because where it's called in Database.write there is a commit statement

PaprikaSteiger commented 3 years ago

Yes, we are calling pycldf.db.insert but not Database.write. Maybe that's not how it was meant to be used.

xrotwang commented 3 years ago

No, it was not meant that way. Feeding the database directly, without using a Dataset object is possible - I guess - but bypasses some of the validation. Also, what's specified by CLDF is the CSV-based file format, whereas the SQLite database is mostly an alternative backend for better query performance. So if you are interested in creating shareable data, you should write out the data to "proper" CLDF datsets at some point. If you'd do that as intermediate step from excel to SQLite, you could even use Database.write_from_tg as intended :)

xrotwang commented 3 years ago

Btw. lexedata sounds interesting! Is there a way to have a look at the code? E.g. it would be interesting how one would enforce integrity constraints in excel.

Anaphory commented 3 years ago

It's among my public repositories. We are not eforcing integrity constraints in Excel, the plan is to provide a specific export that makes it easy to edit some aspect of the data in a way where constraints are maintained and make it hard to impossible to try editing in a constraint-violating way, and to make sure that people are aware that it's a partial view of some aspect of the data and that the authoritative version is the CLDF. This means that our strategy to actually enforce constraints is to whine* upon re-import.

*) as in, guess fixes that restore integrity where possible, warn about both issues that were automatically fixed and issues that could not be fixed.

xrotwang commented 3 years ago

Ok, so in a data publishing scenario you could just

Anaphory commented 3 years ago

Ok, so in a data publishing scenario you could just

  • use lexedata from within cldfbench
  • throw the excel file in raw/ and get a full lexibank dataset, correct?

In theory, yes; in practice, lexedata.importer.fromexcel is supposed to be a pre-processing step for a lexical dataset that is not at all CLDF-compatible yet, where the editors of the dataset want to continue editing the dataset, but with an authoritative version in versioned CLDF instead of whatever ideosyncratic format they had before. So I don't think lexibank entry points in cldfbench are the right context at all.

In terms of language & concept metadata, it imports whatever is given by the source – there could be multiple header rows for each language, containing ideosyncratically formatted language metadata, which the importer can extract, but which should not be trusted without further human intervention.

xrotwang commented 3 years ago

When you say

lexedata.importer.fromexcel is supposed to be a pre-processing step for a lexical dataset that is not at all CLDF-compatible yet, where the editors of the dataset want to continue editing the dataset, but with an authoritative version in versioned CLDF

that sounds exactly like what cldfbench/pylexibank is supposed to make easier: The data in the form maintained and edited by the user goes into raw/, etc/ adds metadata, and the bits of python code in lexibank_*.py churn out the CLDF data to cldf/. And lexedata would add functionality to be called in lexiank_*.py for the case that raw/ holds excel files of the type that lexedata knows how to handle.

Or am I missing something?