autotyp / autotyp-data

AUTOTYP data export
Creative Commons Attribution 4.0 International
38 stars 20 forks source link

Provide data in CLDF format #2

Open xrotwang opened 7 years ago

xrotwang commented 7 years ago

It may be worthwhile to change the data format in this repos to CLDF. As far as I can tell, not too many changes would be required to do so:

What you would gain:

tzakharko commented 7 years ago

For the initial export, we chose YAML because of its human-readability. This was done to make the data more accessible to a wider audience. We aim to provide supplementary JSON-LD metadata in future releases that expose more of the database internal structure.

In the meantime, if there is interest, it should be possible to create a pipeline that converts the current CSV/YAML format to CLDF CSV/JSON-LD automatically. We will gladly accept community help in setting this up.

@all: Please comment in this thread if you are interested in creating such a pipeline, we could use it to draft a roadmap.

xrotwang commented 7 years ago

I will have a look into this. Could be a good example for the refactored CLDF structure dataset spec.

xflr6 commented 6 years ago

First stab at the conversion is here: https://github.com/clld/autotyp-data/blob/cldf/autotyp_to_cldf.py To workaround #9 and #10, the ill-formed data was removed, cf. the commits in the issues branch Result as a ZIP-file: autotyp-cldf.zip

tzakharko commented 2 years ago

First of all, apologies that it took a while — our previous database pipeline was unmaintainable and so we had to redesign and rebuild it from scratch. With the new pipeline we are better equipped for tracking the dependencies between datasets (not explicitly part of metadata yet, but will be soonish), and so it is a good time to revisit this issue and chart a way for provide a robust solution.

One potential difficulty I see is that we decided to go with nested/repeated data for some datasets, as it simplifies handling and conceptualisation in practice. What would be a good way of mapping this kind of data model to CLDF? If I understand correctly, there is some support for repeated simple values, but what about nested records?

xrotwang commented 2 years ago

A relatively straightforward way to handle this is using JSON serialized as string as values, and adding enough metadata to the ParameterTable to make this transparent. A complete example using pycldf looks like this:

from csvw.metadata import Datatype
from pycldf import StructureDataset

ds = StructureDataset.in_dir('ds')
ds.add_component('ParameterTable', {'name': 'datatype', 'datatype': 'json'})
ds.write(
    ParameterTable=[dict(ID='pid', datatype='json')],
    ValueTable=[dict(ID='1', Language_ID='l', Parameter_ID='pid', Value='{"a": 2}')])

dt = Datatype.fromvalue(ds.get_object('ParameterTable', 'pid').data['datatype'])
for v in ds['ValueTable']:
    v = dt.parse(v['Value'])
    assert isinstance(v, dict)
    print(v['a'])

Here, we add a column datatype to ParameterTable, and mark it as JSON column (which is understood by csvw). When reading data from ValueTable, we first instantiate a csvw.metadata.Datatype instance from the datatype spec in ParameterTable, and then use this object to parse the value accordingly.

xrotwang commented 2 years ago

Btw. I'm in the process of putting together a conversion from the AUTOTYP v1.0 to a CLDF dataset - that's how I turn up all the issues I posted :)

tzakharko commented 2 years ago

A relatively straightforward way to handle this is using JSON serialized as string as values, and adding enough metadata to the ParameterTable to make this transparent. A complete example using pycldf looks like this:

That's neat! But at this point, what is the value of using CSV at all? Why not just go all JSON?

Btw. I'm in the process of putting together a conversion from the AUTOTYP v1.0 to a CLDF dataset - that's how I turn up all the issues I posted :)

Keep them coming :) One problem is that the published YAML metadata is just a subset of the much richer metadata we maintain internally for the export pipeline, and the mapping is not perfect. There are many improvements planned here, e.g. relationships between fields, more precise types and constraints etc. — these things unfortunately didn't make it for the big release.

xrotwang commented 2 years ago

As soon as a particular data type for values becomes more wide-spread - including standard analysis methods - it becomes a candidate for "more" standardisation in CLDF. Putting it into CLDF now basically puts it "on track" for this. Also, CSV - even if it includes smallish JSON snippets - plays nicer with version control, because it doesn't have (mostly) the "unspecified whitespace" and "attribute order" issues of JSON or XML.

I should add that CLDF comes with "built-in" validation. E.g. stuff like invalid values for categorical data, non-existent Glottocodes, etc. will be flagged "out of the box". And generating human readable metadata descriptions is easy, e.g. with cldfbench (see e.g. https://github.com/glottolog/glottolog-cldf/blob/master/cldf/README.md). So arguably, making CLDF the target release format for AUTOTYP might solve some of the issues here.

tzakharko commented 2 years ago

@xrotwang could you share your CLDF conversion pipeline with me? I would like to add it to the build system, so that we have CLDF as first class target.

xrotwang commented 2 years ago

It's here: https://github.com/cldf-datasets/autotypcldf Using https://github.com/cldf/cldfbench autotyp-data is pulled in as git submodule, see https://github.com/cldf-datasets/autotypcldf/tree/main/raw And the conversion is run via

cldfbench makecldf cldfbench_autotyp.py --glottolog-version v4.5

which basically runs the code in https://github.com/cldf-datasets/autotypcldf/blob/main/cldfbench_autotypcldf.py

tzakharko commented 2 years ago

CLDF dataset is now available in the cldf-export branch

The python dataset classes are here. I have copied your code verbatim, just adjusted the file paths and removed the bibliography fix since it is not necessary anymore.

Could you have a look whether the CLDF data is ok like this? If there are no concerns I can draft a 1.1.0 release.

xrotwang commented 2 years ago

Looks ok:

$ cldf stats StructureDataset-metadata.json 
<cldf:v1.0:StructureDataset at .>
                     value
-------------------  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dc:conformsTo        http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
dc:source            sources.bib
prov:wasDerivedFrom  [{'rdf:about': 'new-autotyp-preview', 'rdf:type': 'prov:Entity', 'dc:created': 'v1.0.1-1-g1d0af14', 'dc:title': 'Repository'}, {'rdf:about': 'https://github.com/glottolog/glottolog', 'rdf:type': 'prov:Entity', 'dc:created': 'v4.5', 'dc:title': 'Glottolog'}, {'rdf:about': 'new-autotyp-preview', 'rdf:type': 'prov:Entity', 'dc:created': 'v1.0.1-1-g1d0af14', 'dc:title': 'Repository'}]
prov:wasGeneratedBy  [{'dc:title': 'python', 'dc:description': '3.9.10'}, {'dc:title': 'python-packages', 'dc:relation': 'requirements.txt'}]
rdf:ID               autotyp
rdf:type             http://www.w3.org/ns/dcat#Distribution

                   Type                 Rows
-----------------  -----------------  ------
values.csv         ValueTable         278536
languages.csv      LanguageTable        3053
contributions.csv  ContributionTable      46
parameters.csv     ParameterTable       1013
codes.csv          CodeTable            1402
sources.bib        Sources              5001

and creating a SQLite db from it works as well.

So, looks good to me.

nataliacp commented 1 year ago

I have posted a comment on closed issue #51 (which I can't reopen), so I am copying it here as it is relevant for the conversion to the cldf format. It is about the synthesis module but it could be applicable for other complex modules too.

I have a proposal to increase data reusability in cldf. Right now, the variables listed in the first comment in this thread are within a JSON format under the MaximallyInflectedVerbSynthesis umbrella variable. Most of these variables though are simple binary per-language variables and they could be incorporated straightforwardly in the CLDF format. The only problem is that the only values for these variables that can be trusted are for the languages that are TRUE for both housekeeping variables (IsVerbAgreementSurveyComplete and IsVerbInflectionSurveyComplete). What do you think @tzakharko and @xrotwang?