Multi lingual dataset support

RickMoynihan commented 6 years ago

RDF supports lang strings, and there's a possibility of multi-lingual datasets.

We may want to add support for this as part of OGI.

zeginis commented 6 years ago

I agree. At OGI there are multi-lingual datasets.

We may consider using JSON-LD (@language) to express the language used

RickMoynihan commented 6 years ago

I'm no longer sure we can use JSONLD, but am curious about the requirements for multiple languages.

For example would a multilingual client want to list all labels in all languages? Or should it only ever get back a single requested (or default) language?

e.g. you could imagine changing the language at the outermost field for the whole subtree e.g.:

{  
   datasets(language:"fr") { 
      title
      dimensions { 
         values {
           label 
         }
      }
   }
}

Obviously we could also let you query for what languages are currently in the system, e.g.

{
   languages { 
       country_code
   }
}

Other alternatives are to expand every string field into two sub fields of lang and value, which seems pretty heavy handed. Or to generate fields in the schema for every language in the system e.g. title_fr title_en title_gb.

zeginis commented 6 years ago

I think a single requested or default language is enough. So something like "datasets(language:"fr"){..}; is ok.

It is preferable to get the available languages for a specific dataset not for the whole system because different datasets may have different available languages. e.g.

{
   languages(dataset: "http://statistics.gov.scot/data/earnings") { 
       country_code
   }
}

RickMoynihan commented 6 years ago

It is preferable to get the available languages for a specific dataset not for the whole system because different datasets may have different available languages.

👍

zeginis commented 6 years ago

Some issues related to the language:

greek labels were not supported
language tag (e.g. @en) causes errors

RickMoynihan commented 6 years ago

Specifically the current problem with language strings is that they cause exceptions during schema generation by failing the following spec (from issue #53):

In: [0 :objects :dataset_vehicles_cube 1 :description] val: #grafter.rdf.protocols.LangString{:string "Vehicles Cube", :lang :en} fails spec: :com.walmartlabs.lacinia.schema/description at: [:args :schema :objects 1 :description] predicate: string?

@zeginis I think it would be desirable to keep the graphql schema simple here and avoid having to represent multiple languages in the schema at this stage, i.e. we should avoid doing things like this for every label/title:

{ 
  title {
     title  # the real title string
     language
  }    
}

i.e. I think I'd rather keep the schema for labels flat like this:

{
   title
}

This will probably mean in the cases of multiple languages setting a default to use everywhere throughout the API; we could potentially allow toggling the default at the top of the query.

@zeginis Does that sound like an acceptable compromise? Limitation is that within a single request you'll not be able to see things like the title for a dataset in english and greek.

zeginis commented 6 years ago

It is ok to define the language at the top of the query and thus get results only in one language

RickMoynihan commented 6 years ago

One other question @zeginis, would it be acceptable to not let you set this at the top of the query; but to supply it as a configuration option to the server itself? i.e. no schema representation at all?

zeginis commented 6 years ago

One other question @zeginis, would it be acceptable to not let you set this at the top of the query; but to supply it as a configuration option to the server itself? i.e. no schema representation at all?

@RickMoynihan this solution is not applicable at OGI since we will have cubes from many pilots at the same server that will have labels in different languages e.g. Greek, English.

So it is preferable to define the language at the top of the query. Any idea how to do this?

RickMoynihan commented 6 years ago

Any idea how to do this?

It's not currently supported; if you're asking about how I think it should be implemented, then I'd suggest:

We should introduce a new root cubiql node to support various parameters such as this for all subtree schemas. The idea being that parameters set at the root affect those parts of the query within its lexical scope:

i.e. we would probably have to change it to do this, so lang_preference affects not just datasets but specific dataset schemas, and any others we add too:

{  
   cubiql(lang_preference: "gr") { 
      datasets { 
          title 
          description
      }  
   }

The lang_preference attribute should specify a language tag preference, not a hard constraint, i.e. if you have a dataset with a dcterms:title of "School"^^xsd:string and "σχολείο"@gr it should select the greek title. If however for description it only has :school-ds dcterms:description "Numbers of schools by area"^^xsd:string, then it should fallback and return thexsd:string`.

In terms of implementation I don't think there is a good way to express this priority on labels in SPARQL in a performant and simple enough way. So I think the best way to implement this is to make sure we implement all these queries as CONSTRUCTs, and then implement the priority filtering on all returned data. The algorithm roughly would be to group the local graph of ?s ?p ?o by ?s ?p, then for each ?s ?p where ?o is DATATYPE xsd:string || rdf:langString return only ?o where the ?o matches lang_preference or failing that return an xsd:string and failing that return any other rdf:langString.

Is something like this what you were thinking of implementing?

zeginis commented 6 years ago

Yes this is what I was thinking to implement. I realize that it is not as simple as I expected.

Do you think there is a way to temporarily overcome the exceptions (#88) caused by the language tags even if we do not fully support filtering by language?

RickMoynihan commented 6 years ago

That's a good question @zeginis. I suspect it's a pretty trivial fix to make that specific error go away, as it's probably not much more than calling str on the language tagged string before returning it.

However there's still the expectation that there's only ONE value for a lot of these fields. So this would likely only really work for string properties with a cardinality of 1; as to retain the schema you'll need to pick just one string; and then you're into the territory of the above suggestion.

I could be wrong but I'm not sure this hacky solution is worth doing, because you either need to implement the prioritisation logic above, or return a random string (unnacceptable as datasets would render with mixed languages), or hack your data so you only ever have one string for these fields (either an rdf:langString or xsd:string would work -- but not both or more than one of each - i.e. no multi-lingual datasets). My feeling is if you have to hack your data to remove the strings you don't want, you might as well have just hacked your data to make them xsd:strings.

The only counter-argument I can see to this (in support of implementing the str hack) is that it does mean cubiql will support a more correct subset of a larger cube. i.e. it's marginally better to allow "σχολείο"@gr, in preference to "σχολείο"^^xsd:string; as you're not downgrading information; you're just loading a subset into cubiqls endpoint.

Practically speaking though, I'm not sure this correctness argument holds much weight though as you'll still need to hack your data to guarantee it works... it's just the hack is a tiny bit less hacky.

zeginis commented 6 years ago

@RickMoynihan any update on this?

Are you going to fix this or we should go on with the "quick fix" option -> call the str on the language tagged string before returning it ?

Swirrl / cubiql

Multi lingual dataset support #6