cldf / pycldf

python package to read and write CLDF datasets
https://cldf.clld.org
Apache License 2.0
15 stars 7 forks source link

db import translates SourceTable many-to-many relationships, but not those of other tables #105

Closed Anaphory closed 4 years ago

Anaphory commented 4 years ago

I tried to generate a dataset in SQLite format and import it into CLDF using pycldf.db.Database.to_cldf. Mostly, I could wrap my head around the requirements on that database structure, with its useful translations to CLDF ontology terms. However, I found a little-use spot where that translation is not happening, and that was confusing, and I don't know enough about csvw to start suggesting a fix.

My metadata file is given at the bottom of this post. The interesting feature is the "separator": ";", in FormTable.parameterReference. This is nicely parsed as a many-to-many connection through an association table. However, where all other tables and columns are expected to be named according to CLDF terms, the first column of this table is expected to be forms.csv_concepts.csv.ID, not FormTable_ParameterTable.cldf_id, which is not only inconsistent with how pycldf.db.Database works otherwise, but also problematic because the . in the URL is passed to the table name generator without question. I currently work around this issue by explicitly setting

db.tables[2].many_to_many["Concept_IDs"].name = "FormTable_ParameterTable"

but I think the clean way would be if pycldf.db treated these properties the same way it treats every other potential CLDF Term.

{
    "@context": [
        "http://www.w3.org/ns/csvw",
        {
            "@language": "en"
        }
    ],
    "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#Wordlist",
    "dialect": {
        "commentPrefix": null
    },
    "tables": [
        {
            "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#FormTable",
            "dc:extent": 32465,
            "tableSchema": {
                "columns": [
                    {
                        "datatype": {
                            "base": "string",
                            "format": "[a-zA-Z0-9_\\-]+"
                        },
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id",
                        "required": true,
                        "name": "ID"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#languageReference",
                        "required": false,
                        "name": "Language_ID"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#parameterReference",
                        "required": false,
                        "separator": ";",
                        "name": "Concept_IDs"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "Phonemic_Transcription"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#form",
                        "required": false,
                        "name": "Phonetic_Transcription"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "Orthographic_Transcription"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "Variants_of_Form_given_by_Source"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#value",
                        "required": true,
                        "name": "Original_Value_of_Cell"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#comment",
                        "required": false,
                        "name": "Comment"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#source",
                        "required": false,
                        "separator": ";",
                        "name": "Source"
                    }
                ],
                "foreignKeys": [
                    {
                        "columnReference": [
                            "Language_ID"
                        ],
                        "reference": {
                            "resource": "languages.csv",
                            "columnReference": [
                                "ID"
                            ]
                        }
                    },
                    {
                        "columnReference": [
                            "Concept_IDs"
                        ],
                        "reference": {
                            "resource": "concepts.csv",
                            "columnReference": [
                                "ID"
                            ]
                        }
                    }
                ],
                "primaryKey": [
                    "ID"
                ]
            },
            "url": "forms.csv"
        },
        {
            "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#ParameterTable",
            "tableSchema": {
                "columns": [
                    {
                        "datatype": {
                            "base": "string",
                            "format": "[a-zA-Z0-9_\\-]+"
                        },
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id",
                        "required": true,
                        "name": "ID"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#name",
                        "required": true,
                        "name": "English"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "Set"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "English_Strict"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "Spanish"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "French"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "Portuguese"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#comment",
                        "required": false,
                        "name": "Comment"
                    }
                ],
                "primaryKey": [
                    "ID"
                ]
            },
            "url": "concepts.csv"
        },
        {
            "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#LanguageTable",
            "dc:extent": 38,
            "tableSchema": {
                "columns": [
                    {
                        "datatype": {
                            "base": "string",
                            "format": "[a-zA-Z0-9_\\-]+"
                        },
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id",
                        "required": true,
                        "name": "ID"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#name",
                        "required": false,
                        "name": "Name"
                    },
                    {
                        "datatype": {
                            "base": "string",
                            "format": "[a-z0-9]{4}[1-9][0-9]{3}"
                        },
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#glottocode",
                        "required": false,
                        "valueUrl": "http://glottolog.org/resource/languoid/id/{glottocode}",
                        "name": "Glottocode"
                    },
                    {
                        "datatype": {
                            "base": "string",
                            "format": "[a-z]{3}"
                        },
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#iso639P3code",
                        "required": false,
                        "name": "ISO639P3code"
                    },
                    {
                        "datatype": "string",
                        "required": false,
                        "name": "Curator"
                    },
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#comment",
                        "required": false,
                        "name": "Comment"
                    }
                ],
                "primaryKey": [
                    "ID"
                ]
            },
            "url": "languages.csv"
        }
    ]
}
xrotwang commented 4 years ago

I agree with this naming scheme being slightly inconsistent. At least whenever a component is referenced using an explicit CLDF *Reference property, this should result in the component name being used in the table name rather than the table's url.

I haven't found the . in table names to be much of a problem, though. Just escape the table name with "*.*", e.g. here: https://github.com/pofatu/pofatu-data#querying-the-sqlite-data

xrotwang commented 4 years ago

I have a fix for this bug, but I'm not sure whether this would constitute a breaking change.

@chrzyki @SimonGreenhill @LinguList @Anaphory : So the fix will result in a (potentially) different schema for SQLite dbs created from CLDF dataset. If the previous schema can be considered buggy, it would just be a bugfix, people would need to re-generate their databases, but hopefully they would not have written queries against the buggy parts of the schema. What do you think? (I wouldn't want to implement a backwards compat fallback, because that part of the code is already complex enough ...)

SimonGreenhill commented 4 years ago

Break it now. I think the odds of someone using the old schema here are pretty low.

xrotwang commented 4 years ago

@SimonGreenhill you mean breaking without bumping the major version, right?

SimonGreenhill commented 4 years ago

yeah, I don't think we need a new major version here.

chrzyki commented 4 years ago

Agreeing with @SimonGreenhill.

xrotwang commented 4 years ago

@Anaphory just released 1.13.0 with a fix for this. Let me know whether this works for you!

Anaphory commented 4 years ago

I just got around to testing it.

I have an SQLite database containing the following table.

CREATE TABLE IF NOT EXISTS "FormTable_ParameterTable" (
    "FormTable_cldf_id" INTEGER NOT NULL, 
    "ParameterTable_cldf_id" INTEGER NOT NULL, 
    context VARCHAR, 
    "Internal_Comment" VARCHAR, 
    PRIMARY KEY ("FormTable_cldf_id", "ParameterTable_cldf_id"), 
    FOREIGN KEY("FormTable_cldf_id") REFERENCES "FormTable" (cldf_id), 
    FOREIGN KEY("ParameterTable_cldf_id") REFERENCES "ParameterTable" (cldf_id)
);

Running cldf dumpdb cldf-metadata.json lexical_data.sqlite (The metadata json file is essentially the one from my first post) does not fill the Concept_ID column in forms.csv.

I have done some step-by-step debugging. I originally thought that the issue might be in

https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L206

All code around seems to be purely inside csvw, which I thought did not know anything about the translation. But actually, the translation is passed to the csvw.db.Database constructor, and that translation object misses the FormTable_ParameterTable:

> /home/gereon/Develop/csvw/src/csvw/db.py(299)__init__()
-> self.translate = translate or Database.name_translator
(Pdb) translate
functools.partial(<function translate at 0x7f5fff6104c0>, {'forms.csv': TableTranslation(name='FormTable', columns={'ID': 'cldf_id', 'Language_ID': 'cldf_languageReference', 'Concept_IDs': 'cldf_parameterReference', 'Phonetic_Transcription': 'cldf_form', 'Original_Value_of_Cell': 'cldf_value', 'Comment': 'cldf_comment', 'Source': 'cldf_source', 'English': 'cldf_name', 'Name': 'cldf_name', 'Glottocode': 'cldf_glottocode', 'ISO639P3code': 'cldf_iso639P3code'}), 'concepts.csv': TableTranslation(name='ParameterTable', columns={'ID': 'cldf_id', 'Language_ID': 'cldf_languageReference', 'Concept_IDs': 'cldf_parameterReference', 'Phonetic_Transcription': 'cldf_form', 'Original_Value_of_Cell': 'cldf_value', 'Comment': 'cldf_comment', 'Source': 'cldf_source', 'English': 'cldf_name', 'Name': 'cldf_name', 'Glottocode': 'cldf_glottocode', 'ISO639P3code': 'cldf_iso639P3code'}), 'languages.csv': TableTranslation(name='LanguageTable', columns={'ID': 'cldf_id', 'Language_ID': 'cldf_languageReference', 'Concept_IDs': 'cldf_parameterReference', 'Phonetic_Transcription': 'cldf_form', 'Original_Value_of_Cell': 'cldf_value', 'Comment': 'cldf_comment', 'Source': 'cldf_source', 'English': 'cldf_name', 'Name': 'cldf_name', 'Glottocode': 'cldf_glottocode', 'ISO639P3code': 'cldf_iso639P3code'}), 'forms.csv_SourceTable': TableTranslation(name='FormTable_SourceTable', columns={'forms.csv_ID': 'FormTable_cldf_id'})})
Anaphory commented 4 years ago

So the issue is actually that the chunk assembling the translation function

https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L125-L143

operates only on tables mentioned in the metadata, and the later chunk that adds some many-to-many-relation tables

https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L168-L187

only checks sources, not other list-shaped foreign keys that would correspond to many-to-many relations and thus to association tables.

Anaphory commented 4 years ago

I have now added that to the translation (by the way, https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L83 should be a factory and not a default value), I'm still not sure where the culprit is. Should

https://github.com/cldf/pycldf/blob/master/src/pycldf/db.py#L206

mess with the table it gets passed and apply the translation to its name? Or should csvw.db.Database.read() already be aware of that translation and its table.many_to_many already be transformed?

xrotwang commented 4 years ago

@Anaphory I think the behaviour is as expected. The "fake", list-valued foreign key FormTable.Concept_IDs is replaced with the "real" foreign key into ParameterTable from the association table. What would you expect ConceptIDs to be populated with - the concatenated string of all concept IDs? But interpreting this column then would require knowing the metadata to interpret it (to infer the separator), whereas interpreting the association table can be done purely by inspecting the SQL schema.

The translate function does operate on names of objects in the association table. See the tests, https://github.com/cldf/pycldf/blob/dcd5acd68a3ace91e9d8cb121b2eaf7132c1f78c/tests/test_db.py#L110-L146

Anaphory commented 4 years ago

[I see. Now i need to think.]

I did indeed expect that ConceptIDs would be populated with the concatenated string of all concept IDs. Aren't the metadata always supplied and needed anyway? So what is the expense in using it to interpret many-to-many relations?

xrotwang commented 4 years ago

The expense is potentially sabotaging round-tripping. If you export the DB back to CLDF, you'd have to make sure, Concept_IDs and the set of foreign keys from the association table match. And what do you do if they don't?

Anaphory commented 4 years ago

I thought round-tripping CLDF-SQLite-CLDF would also be impossible in this, but apparently I misunderstood you and the code and miscommunicated my issues. I created a minimal example and saw that I get the same CLDF back from the SQLite created from CLDF, so as so often the problem is on my end. I must be messing up the creation of my database somewhere because the code obviously works as intended, and if I conform to its expectations, I should get a clean CLDF out.

xrotwang commented 4 years ago

So, the idea of the SQLite conversion is primarily to

With functioning round-tripping, the SQLite db can also be manipulated to some extent (with caveats, because not all constraints described in the metadata are actually enforced by SQLite) - at least adding rows should work.

Starting out with an existing database schema, and trying to craft CLDF metadata which will allow populating the database, was not one of the usage scenarios - and I can think of various places where such an attempt will fail.