Support for multilingual RDF

amercader commented 8 years ago

Right now, neither the parsers nor the serializers take multilingual metadata into account.

For instance given the following document, a random title among the three will be picked up during parsing time:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat:    <http://www.w3.org/ns/dcat#> .
@prefix dct:     <http://purl.org/dc/terms/> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .

<http://data.london.gov.uk/dataset/Abandoned_Vehicles>
      a       dcat:Dataset ;
      dct:title "Abandoned Vehicles"@en ;
      dct:title "Vehículos Abandonados"@es ;
      adms:versionNotes "Some version notes"@en ;
      adms:versionNotes "Notas de la versión"@es ;

      ...

Parsing

The standard way of dealing with this seems to be to create metadata during the parsing that can be handled by ckanext-fluent when creating or updating the datasets. This essentially means storing a dict instead of a string, with the keys being the language codes:

{

    "version_notes": {
        "en": "Some version notes",
        "es": "Notas de la versión"
    }
    ...

}

For core fields like title or notes, we need to add an extra field suffixed with _translated:

    "title": "",
    "title_translated": {
        "en": "Abandoned Vehicles",
        "es": "Vehiculos Abandonados"
    }
    ...

TODO: what to put in title?

To support it we can proabably have a variant of _object_value that handles the lang tags and returns a dict accordingly (RDFLib will return a different triple for each language).

Serializing

Similarly, the serializing code could check the fields marked as multilingual to see if they are a string or a dict and create triples accordingly, proabably via a helper function.

Things to think about:

Should this be the default or enabled via config option?
This will probably require using ckanext-scheming as well, otherwise multilingual fields won't be properly stored (#56).

amercader commented 8 years ago

@wardi does that sound right? Also see the TODO above, does it matter what we put in there?

metaodi commented 8 years ago

@amercader We start to implement this for DCAT-AP Switzerland, I'll keep you posted. We currently use the ckanext-fluent approach.

amercader commented 8 years ago

Fantastic @metaodi! Let me know if you want me to help with some spec or discussion

metaodi commented 8 years ago

Btw: here is the implementation of our multilingual DCAT-AP Switzerland profile: https://github.com/ogdch/ckanext-switzerland/blob/01652937c8f31f46d8560ab9527826a3c1523c06/ckanext/switzerland/dcat/profiles.py

Behind the scenes we use ckanext-scheming for validation/schema.

The main change to the "original" is the new parameter multilang in the _object_value method. We simply use this for all values where we expect multilingual values.

amercader commented 6 months ago

Note there are two ongoing PRs with initial implementations:

RDF -> CKAN (Parsing): https://github.com/ckan/ckanext-dcat/pull/124
CKAN -> RDF (Serializing): https://github.com/ckan/ckanext-dcat/pull/240

ckan / ckanext-dcat

Support for multilingual RDF #55