eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
360 stars 162 forks source link

Normalize language tags in ValueFactory / SimpleLiteral constructor #667

Closed abrokenjester closed 4 years ago

abrokenjester commented 7 years ago

Followup issue of #665 .

Currently, constructing a new Literal object using a language tag just takes the input string for the language tag as-is. However, language tags are considered case-insensitive. Quoting @nikolavp in discussion on issue #665:

Plain literals have a lexical form and optionally a language tag as defined by [RFC-3066], normalized to lowercase.

Note that here rdf4j and sesame always had the correct behaviour in terms of parsing(lowercasing) https://github.com/eclipse/rdf4j/blob/master/core/rio/languages/src/main/java/org/eclipse/rdf4j/rio/languages/RFC3066LanguageHandler.java#L76 but there might be problems internally with literals created through the factory(the case with which I opened the issue) and probably through SPARQL queries

  • In RDF1.1 the spec optionally allows lowercasing and explicitly says that literals should be compared in a lowercase manner. Section 3.3

Lexical representations of language tags may be converted to lower case. The value space of language tags is always in lower case.

so I think it is valid to always lowercase even if we are using BCP47 despite the issue from Vlado(SES-1999)

So I see two options:

  • always lowercase in simpleliteral's consturctor - if we want to comply with both RDF1.0 and RDF1.1(it allows it)
    • or use the BCP language handler's logic in the constructor to comply with the RDF1.1 spec only

A problem to be solved is how to distinguish literals created by a parser (which already has its own language tag normalization) from literals created directly by user code (or by SPARQL updates): we don't want to normalize twice (performance penalty), and we also don't want to "override" whatever normalization rules the parser applies - so it's not simply a matter of doing toLowerCase() in the SimpleLiteral constructor.

ansell commented 7 years ago

Although the "value space" is always lowercase in both RDF-1.0 and RDF-1.1, there are users who want to preserve their preferred casing. In order to allow users to setup their preferred behaviour, I think the LanguageHandler, set using ParserConfig is the most appropriate location.

To be clear, RDF-1.0 does not require the actual language tags to be lowercased. It merely specifies that the value space, which is only used for comparisons, not concrete formats, is lowercase. We can store the language tags in whatever format necessary to roundtrip concrete formats in the way that users want, as long as we do comparisons using lowercase versions of the language tags. As we have had (valid) complaints in the past about the previous behaviour of always lowercasing, I don't think it is appropriate to go back to that.

To also be clear, we are not violating RFC3066 for RDF-1.0 documents by switching to using BCP47. BCP47 has grandfathered rules for cases that RFC3066 allowed, but are not otherwise valid in BCP47. Ie, it is a full replacement and there are no cases where we could be compliant with only RDF-1.1 as implied in the comment above.

abrokenjester commented 7 years ago

So, to facilitate all these scenarios, I'm thinking we should look at a mechanism to "inject" a LanguageHandler into a ValueFactory. That way it's user-configurable, and if we adapt the parsers to use this injection mechanism, it also prevents duplicate effort and/or the ValueFactory overriding the parser's normalization handling.

ansell commented 7 years ago

The parsers already have an injection mechanism if they support BasicParserSettings.LANGUAGE_HANDLERS, which all of our local parsers and any third-party parsers sitting on AbstractRDFParser do.

I have not so far added lowercase normalization to the BCP47LanguageHandler in pull request #666. It currently round-trips through Locale.Builder which does some tag order normalisation but does not do case normalisation as far as I know, although I haven't added tests for that so far. BCP47 is very clear that it is completely case-insensitive, so it doesn't comment on normalisation of case other than to say that users sometimes prefer particular casings. For example, "en-AU" is preferred by some users rather than the less (if ever, in my recollection) used "en-au".

It may be useful to add another ParserSetting and/or LanguageHandler parameter to explicitly choose full lowercase normalisation.

The current LanguageHandler implementation takes a ValueFactory as a parameter, so it wouldn't be too difficult to inject it into a particular ValueFactory and call it in those cases with "this" as the parameter.

However, adding settings to ValueFactory seems like a bit of overkill though compared to having users who want this behaviour simply call .toLowerCase().intern() themselves on any language literals they create in their code and rely on BasicParserSettings.NORMALIZE_LANGUAGE_TAG and a future BasicParserSettings.LOWERCASE_LANGUAGE_TAG setting in ParserConfig for all their RDFParser objects.

abrokenjester commented 7 years ago

BCP47 does recommend some case formatting:

The format of subtags in the registry is RECOMMENDED as the form to use in language tags. This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived.

These conventions include:

o [ISO639-1] recommends that language codes be written in lowercase ('mn' Mongolian).

o [ISO15924] recommends that script codes use lowercase with the initial letter capitalized ('Cyrl' Cyrillic).

o [ISO3166-1] recommends that country codes be capitalized ('MN' Mongolia).

I realize that that is not the same as enforcing it though. Perhaps we should look at a general formatting utility for this kind of thing?

ansell commented 7 years ago

new Locale.Builder().setLanguageTag("en-au").build().toLanguageTag() returns en-AU, so it is already doing some case normalisation, if/when users switch on BasicParserSettings.NORMALIZE_LANGUAGE_TAGS, after the BCP47 pull request is merged.

abrokenjester commented 7 years ago

A low-cost way of adding some convenience for users would be to add a shortcut util function somewhere which executes new Locale.Builder().setLanguageTag(tag).build().toLanguageTag() . Basically just a formatter util function that can be easily called as part of manually creating a language-tagged literal.

Part of the Literals util class perhaps?

ansell commented 7 years ago

A utility function works for me. I will do up a pull request for it.