fluree / db

Fluree database library
https://fluree.github.io/db/
Other
340 stars 22 forks source link

Add support for language tags #490

Closed mpoffald closed 4 days ago

mpoffald commented 1 year ago

See: https://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-language-identifier, https://www.w3.org/TR/sparql11-query/#matchLangTags

dpetran commented 1 year ago

Notes: If language is present, we need to store it in meta. It is a similar problem to supporting identical values in @list.

bplatz commented 1 year ago

Sample language transaction that should successfully store the language metadata:

[{
  "@context": {"dcterms": "http://purl.org/dc/terms/"},
  "@id": "http://example.org/articles/1",
  "dcterms:title": [
    {
      "@value": "Das Kapital",
      "@language": "de"
    },
    {
      "@value": "Capital",
      "@language": "en"
    }
  ]
},
{
  "@context": {"dcterms": "http://purl.org/dc/terms/"},
  "@id": "http://example.org/articles/2",
  "dcterms:title": "No language title"
},
{
  "@context": {"dcterms": "http://purl.org/dc/terms/"},
  "@id": "http://example.org/articles/3",
  "dcterms:title": [
    {
      "@value": "City",
      "@language": "en"
    },
    {
      "@value": "Ciudad",
      "@language": "es"
    }
  ]
},]

Then this query would return only the de language for title, and therefore only return http://example.org/articles/1":

{"@context": {"dcterms": "http://purl.org/dc/terms/"},
 "select": ["?x", "?title"],
 "where": [["?x", "dcterms:title"  "?title"],
           {"filter": ["(= "de" (lang ?title))"]}]}
aaj3f commented 1 year ago

Prioritizing this as it's very common in JSON-LD datasets and goal for beta will be supporting JSON-LD documents users bring from existing JSON-LD data repos

bplatz commented 1 year ago

There is a choice here to store languages as strings (in the Flake's .-m), or more compact and faster comparing integers. Without other factors pushing in a certain direction, I'd opt for strings for now. I say this becuase (a) the language tags are already quite small (e.g. en), I don't think there is a definitive list we should force, so there may be some user-defined values which would require an independent lookup table, and lastly I don't think it will be used for most data sets.

I do think the 'key' value for langue on the .-m map should be as compact as possible. I'd probably opt for :l but it will be internal, so not sure the actual letter matters.

bplatz commented 4 days ago

This issue is completed.