eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
365 stars 163 forks source link

Support complex wildcard language ranges #2500

Closed hmottestad closed 3 years ago

hmottestad commented 4 years ago

SPARQL langMatch should support basic language ranges, which excludes the use of more complex wildcards. For instance "en-*" is not allowed, but "en" works almost the same way.

Jena has chosen to support these more complex wildcard language ranges. However it does not go as far at to support extended language ranges. Extended language ranges go much further than just supporting wildcard.

This would move us away from the SPARQL 1.1 recommendation, but would align us more closely with Jena.

hmottestad commented 4 years ago

@jeenbroekstra what do you think?

abrokenjester commented 4 years ago

Can you elaborate with a few examples on what such complex wildcards allow you to do that is beyond what the current standard langMatches offers?

If there's a good use case for it, I'm not against adding it. The logical place would be to add an extension to the ExtendedEvaluationStrategy (which is where we also support other not-strictly-sparql-1.1 features).

hmottestad commented 4 years ago

"en-*" matches "en-GB" but not "en".

"de-*-DE" would match "de-latn-DE" but not "de-DE-1996" but would match "de-Latn-DE-1996". (by "extended filtering" it would actually match de-DE-1996 since it understand the groups of the language tags....I still can't quite get my head around it).

Neither of these are possible today.

hmottestad commented 4 years ago

Another option is a sail level config for enabling extended filtering. Essentially allows users to set one of these: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Locale.FilteringMode.html

abrokenjester commented 4 years ago

Thanks for the examples and the link, I think it's a lot clearer now what we're after, and I can see the use of it.

As for where to put this, I think defining it as a separate sail config option may be problematic. If you do it that way, you almost automatically force custom handling of the option by every sail implementation (while if we build it into the evaluation strategy, you get it working for free on pretty much every sail). Note, by the way, that the evaluation strategy is itself a sail config option, currently.

I have been thinking though that I don't particularly like the current inheritance structure we use for evaluation strategies. I'd like it if we could make it a bit more modular so that individual extensions can be configured to be enabled or disabled, rather than the current "all extensions or nothing" approach.

hmottestad commented 4 years ago

@rdstn could you share some examples and expected behaviours with us?

rdstn commented 4 years ago

Sure. The prime example I've been working with is related to SHACL.

We have a string, for example, a name, which must match at least one sign language and at least on language spoken in France. We also do not have processing for Cyrilic, so we have to not match any Cyrl tags. And we dislike the GB variety of English, since American english is strongly encouraged in our enterprise.

This would have to be something like that:

sh:property [
        sh:path voc:name ; 
        sh:qualifiedValueShape( [sh:languageIn ("sgn-*") ; ] ) ;
        sh:qualifiedMinCount 1 ; 
    ] ;
sh:property [
        sh:path voc:name ; 
        sh:qualifiedValueShape( sh:or( [sh:languageIn ("fr-*") ; ][sh:languageIn ("*-FR-*"); ] ) ;) ;
        sh:qualifiedMinCount 1 ; 
    ] ;
sh:property [
        sh:path voc:name ; 
        sh:qualifiedValueShape( [sh:languageIn (*"-Cyrl-*") ; ] ) ;
        sh:qualifiedMaxCount 0 ; 
    ] ;
sh:property [
        sh:path voc:name ; 
        sh:qualifiedValueShape( [sh:languageIn (*"-GB-*") ; ] ) ;
        sh:qualifiedMaxCount 0 ; 
    ] ;

Those use qualified max count, which is not part of SHACL in RDF4J (yet), but I imagine that the corresponding SPARQL can be easily derived.

This would suggest that the following data is invalid (breaks all constraints):

rdf:exampleId voc:name "Name"@en-GB
rdf:exampleId voc:name "Нейм"@en-Cyrl

While the following data is valid:


rdf:exampleId voc:name "Nom"@fr
rdf:exampleId voc:name "I do not exactly know how to transcribe this"@sgn

The final shape example is highly contrived, but I can imagine instances where the first three shapes are of interest.

hmottestad commented 4 years ago

I would prefer "en-*" match "en-GB" but not "en". While "en" would match "en" and "en-GB".

The algorithm in Jena works as follows.

  1. Split lang tag and lang range on - 1.1 If the range has more parts than the lang tag, return false
  2. Iterate through the range array and also lookup same index in the lang tag array 2.1. If they are the same, continue 2.2 If the range is "*" continue 2.3 Else return false
  3. Return true
rdstn commented 4 years ago

Is there a way to match en and only en? We have three possible modes and, up till now, we have talked about only two ways to express them - en-* and en. We could expand it a bit?

  1. en-GB, not en - en-*
  2. en and en-GB - en*
  3. en, not en-GB - en

If it's not feasible, I agree that matching en-* to only include dialects and en to match both a singleton tag and all dialects is also fine.

hmottestad commented 4 years ago

With SHACL I guess it would be possible to do something like "en and not en-*". Other than that I don't think there is support for it in language ranges.

hmottestad commented 4 years ago

@jeenbroekstra could we do this without adding any configuration? The algorithm that Jena uses is backwards compatible, so if we use the same algorithm we could introduce it in the next feature release.

abrokenjester commented 4 years ago

I'm not sure what you mean with "without adding any configuration"? If you mean just adding the feature to the ExtendedEvaluationStrategy I'm fine with that .