cayleygraph / cayley

An open-source graph database
https://cayley.io
Apache License 2.0
14.85k stars 1.25k forks source link

Language tagged strings and i18n #647

Open steffansluis opened 6 years ago

steffansluis commented 6 years ago

I noticed that language tagged strings are already supported internally in Cayley. It would be nice to be able to use this functionality through the HTTP APIs somehow (mostly Gizmo and GraphQL). I'm not entirely sure yet what would be the best way to include this. I would think it work somewhat similarly to the labeled sub-graphs, so I could imagine a Lang directive for Gizmo and a @lang and @language directive for GraphQL (similar to what I proposed in #614). Maybe it makes more sense though to have it work like the @opt directive where it doesn't propagate implicitly to the nested query. In any case, curious to hear what you think about it!

dennwc commented 6 years ago

Good point, language strings are implemented for quite some time, but we don't have any queries for them yet. Also, KV/memstore has no indexes for lang, while SQL/Mongo can be easily extended to provide support for these queries by enabling an index on this field.

This feature is not implemented mostly because we wanted to postpone this work after reification is done, since lang will look like a custom metadata field on string nodes, and all custom metadata queries will work for language as well. But it makes sense to add @lang in current implementation, since it's not that hard and internal details about implementation will change under the hood of the query and will not affect GraphQL.

And, more details should be discussed before starting this. For example, what is the default behavior for name @lang("en") query in case "en" language string does not exists? Should it assume that there is no such predicate, or fallback to string without language tags? What if only German string is available, but not a generic version?

steffansluis commented 6 years ago

SPARQL treats language tagged strings and non-language tagged strings as different values because they are not the same RDF literals. This makes sense, since the RDF spec also states that the datatype differs for language-tagged strings. As such, I would think that when not using the @lang directive the name has to be the regular type of string and with the @lang directive the name would have to be the language-tagged type of string. Therefore, there is no generic version of a language-tagged string since they are distinct values within their type, so no fallback, just a null result if a particular language doesn't have a value defined. The specification of the language is standardized as well by the RDF spec.

iddan commented 5 years ago

Hey, I need to get literals language literals in my Gizmo queries but I can't. I tried to look into the code of Gizmo but I was not able to figure out where the values are formatted and why LangString is not formatted correctly. @dennwc can you help me out?

iddan commented 5 years ago

Found it! My fix is here: https://github.com/cayleygraph/cayley/pull/819

dennwc commented 5 years ago

@iddan If the problem is in Gizmo specifically, I propose to add some accessor on JS side instead. Or maybe wrap language string in a small object with those two fields (on JS side, again)?

Am I missing something here?

iddan commented 5 years ago

Currently, as a Gizmo user, I assume values to always be a regular string or an n3 valid value because this is the current behaviour with strings and URIs. I think it would have been better if Gizmo returned a JSON-LD valid string. People have tried to spec RDF-JSON but they decided JSON-LD is a better approach. Anyway, it should be consistent.

dennwc commented 5 years ago

@iddan I agree, but for regular strings and IRIs the conversion is trivial - it's just a single string after all. For LangString it's different - we have two fields now.

And, at least from my point of view, the "Bob"@en is not exactly easy to use from JS. Right now Cayley converts it to Bob (no quotes, unescaped) when "flattening" results.

But I agree, JSON encoding/decoding of RDF values is not that good or consistent in Cayley. We should use whatever spec is available for it. JSON-LD as you mentioned may be a good target spec.

iddan commented 5 years ago

The JSON-LD value would be:

{ "@language": "en", "@value": "Bob" }

I need a way to get the language for string results, so what do you suggest to do?

dennwc commented 5 years ago

I think we can start by adding support for those values to gizmo.toQuadValue. This way it would be possible to create those values without using lang() JS helper. The second step is to intercept all calls to vm.ToValue() like this one and convert quad.LangString to the JS object you mentioned.

Later me should consider changing our JSON I/O formats to accept/emit those values as well.

iddan commented 5 years ago

I opened a different issue for changing to JSON-LD for further discussion: https://github.com/cayleygraph/cayley/issues/820

iddan commented 5 years ago

@dennwc as far as I understand we currently load language strings correctly from files and when inserting quads through the UI.

dennwc commented 5 years ago

@iddan Correct. The only issues is a query-side support. Right now it's possible to query for exact match, but not with a wildcard in the language, for example.

iddan commented 5 years ago

Got it.

iddan commented 5 years ago

https://github.com/cayleygraph/cayley/pull/834 solved querying for lang strings with regex filter