eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
367 stars 164 forks source link

Possible inconsistency - recommended method for parsing language tags? #2609

Open rdstn opened 4 years ago

rdstn commented 4 years ago

Hello. We are seeing a minor issue with SHACL.

We offer our users a way to input a meta-schema, which we then parse to SHACL. In order to normalize the language tags part of it, we use org.eclipse.rdf4j.model.util.Literals#normalizeLanguageTag. This menas that, for an user input:

lang: {validate: "ZH-CMN-hANS-cn,ZH-YUE-hk"}

We get:

sh:languageIn (\"cmn-Hans-CN\" \"yue-HK\") ;

However, when trying to import the following data:

<http://example-langstr.com/6> a ont:LangStringUniq; ont:uniqRandomCapitalization3 "zh-cmn-Hans-CN text"@zh-cmn-Hans-CN, "zh-yue-HK text"@zh-yue-HK .

We get validation errors, since the data is parsed together with the zh- prefix; whereas the following data is supposedly valid:

<http://example-langstr.com/6> a ont:LangStringUniq; ont:uniqRandomCapitalization3 "zh-cmn-Hans-CN text"@cmn-Hans-CN, "zh-yue-HK text"@yue-HK .

This appears to be because two different methods are used for language tag normalization. Is this a bug, or we should just switch the method which we use for language tag normalization? If so, any pointers towards what we should use instead?

rdstn commented 4 years ago

Forgot to add that this is on 3.3.1

rdstn commented 4 years ago

The same problem applies for x-i-enochian (when normalized) and i-enochian (when parsed for import).

abrokenjester commented 4 years ago

@hmottestad I believe you recently did some work on language tag processing to DRY this up? Is this still a problem in more recent versions?

abrokenjester commented 4 years ago

@rdstn the Literals.normalizeLanguageTag is the officially documented "correct" version to normalize language tags (it's also what all Rio parser rely on, for example). That's not to say there can't be bugs in it of course, but there should not be a separate different normalization process somewhere else in the code base. Like I mentioned above, I'm aware that Havard recently made some significant progress on language tag processing in the SHACL engine, and part of that was making sure the existing normalization code was reused, so this might not be a problem in more recent versions - see #2452 which was fixed in release 3.4.1. Can you verify if the problem still occurs in that release?

hmottestad commented 4 years ago

With regards to normalization I didn't implement any normalization in the SHACL engine for language tags after some discussion with the SHACL user group.

Do I understand correctly that what you want is to be able to normalize all language tags before they are checked against the shapes?

If that's what you need I can add a new "advancedLangTagSupport" feature that can be enabled (or maybe just on by default) and in there we can also add the support for wildcard patterns.

rdstn commented 4 years ago

No need to normalize them on the server side, actually, we are using Literals.normalizeLanguageTags before passing them to SHACL, so we do that on the client side. The weird part is that zh-cmn-hans-CN gets normalized to cmn-hans-CN when normalizing the language tag whereas when importing data, it stays zh-cmn-hans-CN. Same for that enochian entry.

It's not like normalization is completely skipped either, if you try to insert a triple with zh-cmn-HANS-cn, this gets normalized to zh-cmn-hans-CN. It is just that first prefix, which gets consistently removed by normalizeLanguageTags, but the parser doesn't care about.

rdstn commented 4 years ago

Here's some example code illustrating this:

    @Test
    public void testInsertData() {

                String EX_NS = "http://example.org/
        String update = getNamespaceDeclarations() +
                "INSERT DATA { ex:book1 ex:langZh \"Some text\"@zh-cmn-HANS-cn ; ex:langNoZh \"Some text\"@cmn-HANS-cn . } ";
        Update operation = con.prepareUpdate(QueryLanguage.SPARQL, update);

        IRI book1 = f.createIRI(EX_NS, "book1");
        IRI withZh = f.createIRI(EX_NS, "langZh");
        IRI withoutZh = f.createIRI(EX_NS, "langNoZh");
        operation.execute();

        assertTrue(con.hasStatement(book1, withZh, f.createLiteral("Some text", "zh-cmn-hans-CN"), true));
        assertFalse(con.hasStatement(book1, withZh, f.createLiteral("Some text", "cmn-hans-CN"), true));
        assertTrue(con.hasStatement(book1, withoutZh, f.createLiteral("Some text", "cmn-hans-CN"), true));
        assertEquals("cmn-Hans-CN", Literals.normalizeLanguageTag("zh-cmn-HANS-cn"));
        assertEquals("cmn-Hans-CN", Literals.normalizeLanguageTag("cmn-HANS-cn"));
    }
hmottestad commented 4 years ago

Thanks for the great compact test code!

barthanssens commented 3 years ago

Seems like Literals isn't doing anything RDF4J-specific, it relies upon the JDK

public static String normalizeLanguageTag(String languageTag) throws IllformedLocaleException {
    return new Locale.Builder().setLanguageTag(languageTag).build().toLanguageTag().intern();
}
barthanssens commented 3 years ago

https://tools.ietf.org/html/bcp47 2.2.2 mentions that

the primary language subtag ('gan', 'yue', 'cmn') is preferred to using the extended language form ("zh-gan", "zh-yue", "zh-cmn").

So this seems to be correct behavior, though to be checked why the parser behaves differently. Same goes for x-i-enochian I guess, looks like i-enochian is the correct normalization.

One can enable language tag normalizing (disabled by default for performance reasons) when loading data via Rio, but not sure if this works for SPARQL update queries

con.getParserConfig().set(BasicParserSettings.NORMALIZE_LANGUAGE_TAGS, true);

abrokenjester commented 3 years ago

I haven't fully followed this issue, but as I understand it, you're scratching your heads over a difference in normalization between a Rio parser and just using Literals.normalizeLanguageTags manually, correct?

The Rio toolkit contains two implementations of the LanguageHandler interface. One is based on IETF BCP47 (this one reuses Literals.normalizeLanguageTags), but there's another implementation there, based on RFC3066 (which I think is an older spec). It looks as if Rio currently picks the latter as its default language tag handler. That could be where the discrepancy comes from.

abrokenjester commented 3 years ago

If that is the case, I think we can classify this as a bug in Rio, because the RDF 1.1 abstract syntax clearly specifies that language tags are expected to be formatted according to BCP47 (see https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal).

abrokenjester commented 3 years ago

Never mind the above comments, I think this is a red herring, and Rio does in fact use BCP47 language handling by default.

I am somewhat confused over what the exact problem is here. @rdstn can you clarify the following:

However, when trying to import the following data:

<http://example-langstr.com/6> a ont:LangStringUniq; ont:uniqRandomCapitalization3 "zh-cmn-Hans-CN text"@zh-cmn-Hans-CN, "zh-yue-HK text"@zh-yue-HK .

We get validation errors, since the data is parsed together with the zh- prefix;

Specifically what I'd like to know is how you import this data, and what you mean with "is parsed together with the zh- prefix". Parsed by which component? Sorry if I'm being obtuse but I'm just a little lost in pinpointing the exact issue.

rdstn commented 3 years ago

No problem. It's the RIO parser. Here's the call stack to the literal creation code:

createLiteral:114, AbstractValueFactory (org.eclipse.rdf4j.model.impl)
createLiteral:65, RDFStarDecodingValueFactory (org.eclipse.rdf4j.rio.helpers)
createLiteral:197, RDFParserHelper (org.eclipse.rdf4j.rio.helpers)
createLiteral:540, AbstractRDFParser (org.eclipse.rdf4j.rio.helpers)
parseQuotedLiteral:651, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseValue:589, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseValue:56, TriGStarParser (org.eclipse.rdf4j.rio.trigstar)
parseObject:462, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseObjectList:390, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parsePredicateObjectList:363, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseGraph:147, SPARQLUpdateDataBlockParser (org.eclipse.rdf4j.query.parser.sparql)
parseStatement:112, TriGParser (org.eclipse.rdf4j.rio.trig)
parse:179, TurtleParser (org.eclipse.rdf4j.rio.turtle)
parseUpdate:145, SPARQLParser (org.eclipse.rdf4j.query.parser.sparql)
parseUpdate:76, QueryParserUtil (org.eclipse.rdf4j.query.parser)
prepareUpdate:308, SailRepositoryConnection (org.eclipse.rdf4j.repository.sail)
getSparqlUpdateResult:231, StatementsController (org.eclipse.rdf4j.http.server.repository.statements)

I see that normalization appears to be disabled by default - this is why we get the un-normalized data.

I suppose that in order to ensure consistency, we need to normalize the language tags, and also set the org.eclipse.rdf4j.rio.normalize_language_tags flag to true. So, we can have one user inputing SHACL as zh-CMN-hans-cn and the other inseting data tagged as zh-cmn-hans-CN without a hitch.

barthanssens commented 3 years ago

Normalization is indeed disabled by default for performance reasons, and can be enabled via the flag you mentioned, or setting the parserconfig in code.

abrokenjester commented 3 years ago

I'm just going through the backlog and was wondering if there is a still an issue here that needs to be addressed, or if we can close this ticket?