Closed hmottestad closed 3 years ago
@jeenbroekstra what do you think?
Can you elaborate with a few examples on what such complex wildcards allow you to do that is beyond what the current standard langMatches offers?
If there's a good use case for it, I'm not against adding it. The logical place would be to add an extension to the ExtendedEvaluationStrategy
(which is where we also support other not-strictly-sparql-1.1 features).
"en-*" matches "en-GB" but not "en".
"de-*-DE" would match "de-latn-DE" but not "de-DE-1996" but would match "de-Latn-DE-1996". (by "extended filtering" it would actually match de-DE-1996 since it understand the groups of the language tags....I still can't quite get my head around it).
Neither of these are possible today.
Another option is a sail level config for enabling extended filtering. Essentially allows users to set one of these: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Locale.FilteringMode.html
Thanks for the examples and the link, I think it's a lot clearer now what we're after, and I can see the use of it.
As for where to put this, I think defining it as a separate sail config option may be problematic. If you do it that way, you almost automatically force custom handling of the option by every sail implementation (while if we build it into the evaluation strategy, you get it working for free on pretty much every sail). Note, by the way, that the evaluation strategy is itself a sail config option, currently.
I have been thinking though that I don't particularly like the current inheritance structure we use for evaluation strategies. I'd like it if we could make it a bit more modular so that individual extensions can be configured to be enabled or disabled, rather than the current "all extensions or nothing" approach.
@rdstn could you share some examples and expected behaviours with us?
Sure. The prime example I've been working with is related to SHACL.
We have a string, for example, a name, which must match at least one sign language and at least on language spoken in France. We also do not have processing for Cyrilic, so we have to not match any Cyrl tags. And we dislike the GB variety of English, since American english is strongly encouraged in our enterprise.
This would have to be something like that:
sh:property [
sh:path voc:name ;
sh:qualifiedValueShape( [sh:languageIn ("sgn-*") ; ] ) ;
sh:qualifiedMinCount 1 ;
] ;
sh:property [
sh:path voc:name ;
sh:qualifiedValueShape( sh:or( [sh:languageIn ("fr-*") ; ][sh:languageIn ("*-FR-*"); ] ) ;) ;
sh:qualifiedMinCount 1 ;
] ;
sh:property [
sh:path voc:name ;
sh:qualifiedValueShape( [sh:languageIn (*"-Cyrl-*") ; ] ) ;
sh:qualifiedMaxCount 0 ;
] ;
sh:property [
sh:path voc:name ;
sh:qualifiedValueShape( [sh:languageIn (*"-GB-*") ; ] ) ;
sh:qualifiedMaxCount 0 ;
] ;
Those use qualified max count, which is not part of SHACL in RDF4J (yet), but I imagine that the corresponding SPARQL can be easily derived.
This would suggest that the following data is invalid (breaks all constraints):
rdf:exampleId voc:name "Name"@en-GB
rdf:exampleId voc:name "Нейм"@en-Cyrl
While the following data is valid:
rdf:exampleId voc:name "Nom"@fr
rdf:exampleId voc:name "I do not exactly know how to transcribe this"@sgn
The final shape example is highly contrived, but I can imagine instances where the first three shapes are of interest.
I would prefer "en-*" match "en-GB" but not "en". While "en" would match "en" and "en-GB".
The algorithm in Jena works as follows.
-
1.1 If the range has more parts than the lang tag, return falseIs there a way to match en
and only en
? We have three possible modes and, up till now, we have talked about only two ways to express them - en-*
and en
. We could expand it a bit?
en-GB
, not en
- en-*
en
and en-GB
- en*
en
, not en-GB
- en
If it's not feasible, I agree that matching en-*
to only include dialects and en
to match both a singleton tag and all dialects is also fine.
With SHACL I guess it would be possible to do something like "en
and not en-*
". Other than that I don't think there is support for it in language ranges.
@jeenbroekstra could we do this without adding any configuration? The algorithm that Jena uses is backwards compatible, so if we use the same algorithm we could introduce it in the next feature release.
I'm not sure what you mean with "without adding any configuration"? If you mean just adding the feature to the ExtendedEvaluationStrategy I'm fine with that .
SPARQL langMatch should support basic language ranges, which excludes the use of more complex wildcards. For instance "en-*" is not allowed, but "en" works almost the same way.
Jena has chosen to support these more complex wildcard language ranges. However it does not go as far at to support extended language ranges. Extended language ranges go much further than just supporting wildcard.
This would move us away from the SPARQL 1.1 recommendation, but would align us more closely with Jena.