Open rdstn opened 4 years ago
Forgot to add that this is on 3.3.1
The same problem applies for x-i-enochian
(when normalized) and i-enochian
(when parsed for import).
@hmottestad I believe you recently did some work on language tag processing to DRY this up? Is this still a problem in more recent versions?
@rdstn the Literals.normalizeLanguageTag
is the officially documented "correct" version to normalize language tags (it's also what all Rio parser rely on, for example). That's not to say there can't be bugs in it of course, but there should not be a separate different normalization process somewhere else in the code base. Like I mentioned above, I'm aware that Havard recently made some significant progress on language tag processing in the SHACL engine, and part of that was making sure the existing normalization code was reused, so this might not be a problem in more recent versions - see #2452 which was fixed in release 3.4.1. Can you verify if the problem still occurs in that release?
With regards to normalization I didn't implement any normalization in the SHACL engine for language tags after some discussion with the SHACL user group.
Do I understand correctly that what you want is to be able to normalize all language tags before they are checked against the shapes?
If that's what you need I can add a new "advancedLangTagSupport" feature that can be enabled (or maybe just on by default) and in there we can also add the support for wildcard patterns.
No need to normalize them on the server side, actually, we are using Literals.normalizeLanguageTags
before passing them to SHACL, so we do that on the client side. The weird part is that zh-cmn-hans-CN
gets normalized to cmn-hans-CN
when normalizing the language tag whereas when importing data, it stays zh-cmn-hans-CN
. Same for that enochian entry.
It's not like normalization is completely skipped either, if you try to insert a triple with zh-cmn-HANS-cn
, this gets normalized to zh-cmn-hans-CN
. It is just that first prefix, which gets consistently removed by normalizeLanguageTags
, but the parser doesn't care about.
Here's some example code illustrating this:
@Test
public void testInsertData() {
String EX_NS = "http://example.org/
String update = getNamespaceDeclarations() +
"INSERT DATA { ex:book1 ex:langZh \"Some text\"@zh-cmn-HANS-cn ; ex:langNoZh \"Some text\"@cmn-HANS-cn . } ";
Update operation = con.prepareUpdate(QueryLanguage.SPARQL, update);
IRI book1 = f.createIRI(EX_NS, "book1");
IRI withZh = f.createIRI(EX_NS, "langZh");
IRI withoutZh = f.createIRI(EX_NS, "langNoZh");
operation.execute();
assertTrue(con.hasStatement(book1, withZh, f.createLiteral("Some text", "zh-cmn-hans-CN"), true));
assertFalse(con.hasStatement(book1, withZh, f.createLiteral("Some text", "cmn-hans-CN"), true));
assertTrue(con.hasStatement(book1, withoutZh, f.createLiteral("Some text", "cmn-hans-CN"), true));
assertEquals("cmn-Hans-CN", Literals.normalizeLanguageTag("zh-cmn-HANS-cn"));
assertEquals("cmn-Hans-CN", Literals.normalizeLanguageTag("cmn-HANS-cn"));
}
Thanks for the great compact test code!
Seems like Literals isn't doing anything RDF4J-specific, it relies upon the JDK
public static String normalizeLanguageTag(String languageTag) throws IllformedLocaleException {
return new Locale.Builder().setLanguageTag(languageTag).build().toLanguageTag().intern();
}
https://tools.ietf.org/html/bcp47 2.2.2 mentions that
the primary language subtag ('gan', 'yue', 'cmn') is preferred to using the extended language form ("zh-gan", "zh-yue", "zh-cmn").
So this seems to be correct behavior, though to be checked why the parser behaves differently. Same goes for x-i-enochian I guess, looks like i-enochian is the correct normalization.
One can enable language tag normalizing (disabled by default for performance reasons) when loading data via Rio, but not sure if this works for SPARQL update queries
con.getParserConfig().set(BasicParserSettings.NORMALIZE_LANGUAGE_TAGS, true);
I haven't fully followed this issue, but as I understand it, you're scratching your heads over a difference in normalization between a Rio parser and just using Literals.normalizeLanguageTags
manually, correct?
The Rio toolkit contains two implementations of the LanguageHandler
interface. One is based on IETF BCP47 (this one reuses Literals.normalizeLanguageTags
), but there's another implementation there, based on RFC3066 (which I think is an older spec). It looks as if Rio currently picks the latter as its default language tag handler. That could be where the discrepancy comes from.
If that is the case, I think we can classify this as a bug in Rio, because the RDF 1.1 abstract syntax clearly specifies that language tags are expected to be formatted according to BCP47 (see https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal).
Never mind the above comments, I think this is a red herring, and Rio does in fact use BCP47 language handling by default.
I am somewhat confused over what the exact problem is here. @rdstn can you clarify the following:
However, when trying to import the following data:
<http://example-langstr.com/6> a ont:LangStringUniq; ont:uniqRandomCapitalization3 "zh-cmn-Hans-CN text"@zh-cmn-Hans-CN, "zh-yue-HK text"@zh-yue-HK .
We get validation errors, since the data is parsed together with the zh- prefix;
Specifically what I'd like to know is how you import this data, and what you mean with "is parsed together with the zh- prefix". Parsed by which component? Sorry if I'm being obtuse but I'm just a little lost in pinpointing the exact issue.
No problem. It's the RIO parser. Here's the call stack to the literal creation code:
createLiteral:114, AbstractValueFactory (org.eclipse.rdf4j.model.impl)
createLiteral:65, RDFStarDecodingValueFactory (org.eclipse.rdf4j.rio.helpers)
createLiteral:197, RDFParserHelper (org.eclipse.rdf4j.rio.helpers)
createLiteral:540, AbstractRDFParser (org.eclipse.rdf4j.rio.helpers)
parseQuotedLiteral:651, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseValue:589, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseValue:56, TriGStarParser (org.eclipse.rdf4j.rio.trigstar)
parseObject:462, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseObjectList:390, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parsePredicateObjectList:363, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseGraph:147, SPARQLUpdateDataBlockParser (org.eclipse.rdf4j.query.parser.sparql)
parseStatement:112, TriGParser (org.eclipse.rdf4j.rio.trig)
parse:179, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseUpdate:145, SPARQLParser (org.eclipse.rdf4j.query.parser.sparql)
parseUpdate:76, QueryParserUtil (org.eclipse.rdf4j.query.parser)
prepareUpdate:308, SailRepositoryConnection (org.eclipse.rdf4j.repository.sail)
getSparqlUpdateResult:231, StatementsController (org.eclipse.rdf4j.http.server.repository.statements)
I see that normalization appears to be disabled by default - this is why we get the un-normalized data.
I suppose that in order to ensure consistency, we need to normalize the language tags, and also set the org.eclipse.rdf4j.rio.normalize_language_tags
flag to true. So, we can have one user inputing SHACL as zh-CMN-hans-cn
and the other inseting data tagged as zh-cmn-hans-CN
without a hitch.
Normalization is indeed disabled by default for performance reasons, and can be enabled via the flag you mentioned, or setting the parserconfig in code.
I'm just going through the backlog and was wondering if there is a still an issue here that needs to be addressed, or if we can close this ticket?
Hello. We are seeing a minor issue with SHACL.
We offer our users a way to input a meta-schema, which we then parse to SHACL. In order to normalize the language tags part of it, we use
org.eclipse.rdf4j.model.util.Literals#normalizeLanguageTag
. This menas that, for an user input:lang: {validate: "ZH-CMN-hANS-cn,ZH-YUE-hk"}
We get:
sh:languageIn (\"cmn-Hans-CN\" \"yue-HK\") ;
However, when trying to import the following data:
<http://example-langstr.com/6> a ont:LangStringUniq; ont:uniqRandomCapitalization3 "zh-cmn-Hans-CN text"@zh-cmn-Hans-CN, "zh-yue-HK text"@zh-yue-HK .
We get validation errors, since the data is parsed together with the
zh-
prefix; whereas the following data is supposedly valid:<http://example-langstr.com/6> a ont:LangStringUniq; ont:uniqRandomCapitalization3 "zh-cmn-Hans-CN text"@cmn-Hans-CN, "zh-yue-HK text"@yue-HK .
This appears to be because two different methods are used for language tag normalization. Is this a bug, or we should just switch the method which we use for language tag normalization? If so, any pointers towards what we should use instead?