Closed Anaphory closed 4 years ago
I agree with this naming scheme being slightly inconsistent. At least whenever a component is referenced using an explicit CLDF *Reference
property, this should result in the component name being used in the table name rather than the table's url
.
I haven't found the .
in table names to be much of a problem, though. Just escape the table name with "*.*"
, e.g. here: https://github.com/pofatu/pofatu-data#querying-the-sqlite-data
I have a fix for this bug, but I'm not sure whether this would constitute a breaking change.
@chrzyki @SimonGreenhill @LinguList @Anaphory : So the fix will result in a (potentially) different schema for SQLite dbs created from CLDF dataset. If the previous schema can be considered buggy, it would just be a bugfix, people would need to re-generate their databases, but hopefully they would not have written queries against the buggy parts of the schema. What do you think? (I wouldn't want to implement a backwards compat fallback, because that part of the code is already complex enough ...)
Break it now. I think the odds of someone using the old schema here are pretty low.
@SimonGreenhill you mean breaking without bumping the major version, right?
yeah, I don't think we need a new major version here.
Agreeing with @SimonGreenhill.
@Anaphory just released 1.13.0 with a fix for this. Let me know whether this works for you!
I just got around to testing it.
I have an SQLite database containing the following table.
CREATE TABLE IF NOT EXISTS "FormTable_ParameterTable" (
"FormTable_cldf_id" INTEGER NOT NULL,
"ParameterTable_cldf_id" INTEGER NOT NULL,
context VARCHAR,
"Internal_Comment" VARCHAR,
PRIMARY KEY ("FormTable_cldf_id", "ParameterTable_cldf_id"),
FOREIGN KEY("FormTable_cldf_id") REFERENCES "FormTable" (cldf_id),
FOREIGN KEY("ParameterTable_cldf_id") REFERENCES "ParameterTable" (cldf_id)
);
Running cldf dumpdb cldf-metadata.json lexical_data.sqlite
(The metadata json file is essentially the one from my first post) does not fill the Concept_ID
column in forms.csv
.
I have done some step-by-step debugging. I originally thought that the issue might be in
https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L206
All code around seems to be purely inside csvw
, which I thought did not know anything about the translation. But actually, the translation is passed to the csvw.db.Database
constructor, and that translation object misses the FormTable_ParameterTable
:
> /home/gereon/Develop/csvw/src/csvw/db.py(299)__init__()
-> self.translate = translate or Database.name_translator
(Pdb) translate
functools.partial(<function translate at 0x7f5fff6104c0>, {'forms.csv': TableTranslation(name='FormTable', columns={'ID': 'cldf_id', 'Language_ID': 'cldf_languageReference', 'Concept_IDs': 'cldf_parameterReference', 'Phonetic_Transcription': 'cldf_form', 'Original_Value_of_Cell': 'cldf_value', 'Comment': 'cldf_comment', 'Source': 'cldf_source', 'English': 'cldf_name', 'Name': 'cldf_name', 'Glottocode': 'cldf_glottocode', 'ISO639P3code': 'cldf_iso639P3code'}), 'concepts.csv': TableTranslation(name='ParameterTable', columns={'ID': 'cldf_id', 'Language_ID': 'cldf_languageReference', 'Concept_IDs': 'cldf_parameterReference', 'Phonetic_Transcription': 'cldf_form', 'Original_Value_of_Cell': 'cldf_value', 'Comment': 'cldf_comment', 'Source': 'cldf_source', 'English': 'cldf_name', 'Name': 'cldf_name', 'Glottocode': 'cldf_glottocode', 'ISO639P3code': 'cldf_iso639P3code'}), 'languages.csv': TableTranslation(name='LanguageTable', columns={'ID': 'cldf_id', 'Language_ID': 'cldf_languageReference', 'Concept_IDs': 'cldf_parameterReference', 'Phonetic_Transcription': 'cldf_form', 'Original_Value_of_Cell': 'cldf_value', 'Comment': 'cldf_comment', 'Source': 'cldf_source', 'English': 'cldf_name', 'Name': 'cldf_name', 'Glottocode': 'cldf_glottocode', 'ISO639P3code': 'cldf_iso639P3code'}), 'forms.csv_SourceTable': TableTranslation(name='FormTable_SourceTable', columns={'forms.csv_ID': 'FormTable_cldf_id'})})
So the issue is actually that the chunk assembling the translation function
https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L125-L143
operates only on tables mentioned in the metadata, and the later chunk that adds some many-to-many-relation tables
https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L168-L187
only checks sources, not other list-shaped foreign keys that would correspond to many-to-many relations and thus to association tables.
I have now added that to the translation (by the way, https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L83 should be a factory and not a default value), I'm still not sure where the culprit is. Should
https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L206
mess with the table
it gets passed and apply the translation to its name? Or should csvw.db.Database.read()
already be aware of that translation and its table.many_to_many
already be transformed?
@Anaphory I think the behaviour is as expected. The "fake", list-valued foreign key FormTable.Concept_IDs
is replaced with the "real" foreign key into ParameterTable
from the association table. What would you expect ConceptIDs
to be populated with - the concatenated string of all concept IDs? But interpreting this column then would require knowing the metadata to interpret it (to infer the separator), whereas interpreting the association table can be done purely by inspecting the SQL schema.
The translate
function does operate on names of objects in the association table. See the tests, https://github.com/cldf/pycldf/blob/dcd5acd68a3ace91e9d8cb121b2eaf7132c1f78c/tests/test_db.py#L110-L146
[I see. Now i need to think.]
I did indeed expect that ConceptIDs would be populated with the concatenated string of all concept IDs. Aren't the metadata always supplied and needed anyway? So what is the expense in using it to interpret many-to-many relations?
The expense is potentially sabotaging round-tripping. If you export the DB back to CLDF, you'd have to make sure, Concept_IDs
and the set of foreign keys from the association table match. And what do you do if they don't?
I thought round-tripping CLDF-SQLite-CLDF would also be impossible in this, but apparently I misunderstood you and the code and miscommunicated my issues. I created a minimal example and saw that I get the same CLDF back from the SQLite created from CLDF, so as so often the problem is on my end. I must be messing up the creation of my database somewhere because the code obviously works as intended, and if I conform to its expectations, I should get a clean CLDF out.
So, the idea of the SQLite conversion is primarily to
With functioning round-tripping, the SQLite db can also be manipulated to some extent (with caveats, because not all constraints described in the metadata are actually enforced by SQLite) - at least adding rows should work.
Starting out with an existing database schema, and trying to craft CLDF metadata which will allow populating the database, was not one of the usage scenarios - and I can think of various places where such an attempt will fail.
I tried to generate a dataset in SQLite format and import it into CLDF using
pycldf.db.Database.to_cldf
. Mostly, I could wrap my head around the requirements on that database structure, with its useful translations to CLDF ontology terms. However, I found a little-use spot where that translation is not happening, and that was confusing, and I don't know enough aboutcsvw
to start suggesting a fix.My metadata file is given at the bottom of this post. The interesting feature is the
"separator": ";",
in FormTable.parameterReference. This is nicely parsed as a many-to-many connection through an association table. However, where all other tables and columns are expected to be named according to CLDF terms, the first column of this table is expected to beforms.csv_concepts.csv.ID
, notFormTable_ParameterTable.cldf_id
, which is not only inconsistent with howpycldf.db.Database
works otherwise, but also problematic because the.
in the URL is passed to the table name generator without question. I currently work around this issue by explicitly settingbut I think the clean way would be if
pycldf.db
treated these properties the same way it treats every other potential CLDF Term.