Closed abrokenjester closed 4 years ago
Although the "value space" is always lowercase in both RDF-1.0 and RDF-1.1, there are users who want to preserve their preferred casing. In order to allow users to setup their preferred behaviour, I think the LanguageHandler, set using ParserConfig is the most appropriate location.
To be clear, RDF-1.0 does not require the actual language tags to be lowercased. It merely specifies that the value space, which is only used for comparisons, not concrete formats, is lowercase. We can store the language tags in whatever format necessary to roundtrip concrete formats in the way that users want, as long as we do comparisons using lowercase versions of the language tags. As we have had (valid) complaints in the past about the previous behaviour of always lowercasing, I don't think it is appropriate to go back to that.
To also be clear, we are not violating RFC3066 for RDF-1.0 documents by switching to using BCP47. BCP47 has grandfathered rules for cases that RFC3066 allowed, but are not otherwise valid in BCP47. Ie, it is a full replacement and there are no cases where we could be compliant with only RDF-1.1 as implied in the comment above.
So, to facilitate all these scenarios, I'm thinking we should look at a mechanism to "inject" a LanguageHandler into a ValueFactory. That way it's user-configurable, and if we adapt the parsers to use this injection mechanism, it also prevents duplicate effort and/or the ValueFactory overriding the parser's normalization handling.
The parsers already have an injection mechanism if they support BasicParserSettings.LANGUAGE_HANDLERS
, which all of our local parsers and any third-party parsers sitting on AbstractRDFParser
do.
I have not so far added lowercase normalization to the BCP47LanguageHandler
in pull request #666. It currently round-trips through Locale.Builder
which does some tag order normalisation but does not do case normalisation as far as I know, although I haven't added tests for that so far. BCP47 is very clear that it is completely case-insensitive, so it doesn't comment on normalisation of case other than to say that users sometimes prefer particular casings. For example, "en-AU" is preferred by some users rather than the less (if ever, in my recollection) used "en-au".
It may be useful to add another ParserSetting
and/or LanguageHandler
parameter to explicitly choose full lowercase normalisation.
The current LanguageHandler
implementation takes a ValueFactory
as a parameter, so it wouldn't be too difficult to inject it into a particular ValueFactory
and call it in those cases with "this" as the parameter.
However, adding settings to ValueFactory
seems like a bit of overkill though compared to having users who want this behaviour simply call .toLowerCase().intern()
themselves on any language literals they create in their code and rely on BasicParserSettings.NORMALIZE_LANGUAGE_TAG
and a future BasicParserSettings.LOWERCASE_LANGUAGE_TAG
setting in ParserConfig
for all their RDFParser
objects.
BCP47 does recommend some case formatting:
The format of subtags in the registry is RECOMMENDED as the form to use in language tags. This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived.
These conventions include:
o [ISO639-1] recommends that language codes be written in lowercase ('mn' Mongolian).
o [ISO15924] recommends that script codes use lowercase with the initial letter capitalized ('Cyrl' Cyrillic).
o [ISO3166-1] recommends that country codes be capitalized ('MN' Mongolia).
I realize that that is not the same as enforcing it though. Perhaps we should look at a general formatting utility for this kind of thing?
new Locale.Builder().setLanguageTag("en-au").build().toLanguageTag()
returns en-AU
, so it is already doing some case normalisation, if/when users switch on BasicParserSettings.NORMALIZE_LANGUAGE_TAGS, after the BCP47 pull request is merged.
A low-cost way of adding some convenience for users would be to add a shortcut util function somewhere which executes new Locale.Builder().setLanguageTag(tag).build().toLanguageTag()
. Basically just a formatter util function that can be easily called as part of manually creating a language-tagged literal.
Part of the Literals
util class perhaps?
A utility function works for me. I will do up a pull request for it.
Followup issue of #665 .
Currently, constructing a new
Literal
object using a language tag just takes the input string for the language tag as-is. However, language tags are considered case-insensitive. Quoting @nikolavp in discussion on issue #665:A problem to be solved is how to distinguish literals created by a parser (which already has its own language tag normalization) from literals created directly by user code (or by SPARQL updates): we don't want to normalize twice (performance penalty), and we also don't want to "override" whatever normalization rules the parser applies - so it's not simply a matter of doing
toLowerCase()
in theSimpleLiteral
constructor.