Schematron / schematron

Schematron "skeleton" - XSLT implementation
MIT License
93 stars 45 forks source link

Localization concept needs improvement #40

Open tofi86 opened 7 years ago

tofi86 commented 7 years ago

Hey,

after attending the first ever Schematron Users Meetup at XML Prague this year, I'm thrilled to see that schematron is coming back to life — thanks @rjelliffe, @AndrewSales and @tgraham-antenna for your work!

As a contributor to the EpubCheck project (EPUB validation) and the SQF Schematron QuickFix project, I'd like to open up this issue and start a discussion about improvements to the Schematron localization concepts — or at least for the Skeleton implementation.

The EpubCheck project uses Java properties files for localization, but also has several Schematron checks which cannot be localized at the moment because the official Skeleton implementation used by Jing validator does not support this. There has been discussion about this since October 2014 at issue https://github.com/IDPF/epubcheck/issues/474

And more recently, the SQF project struggled with this as well in https://github.com/schematron-quickfix/sqf/issues/1.

Annex G of the ISO Schematron specification defines the use of multilingual Schematron as follows:

Diagnostics in multiple languages may be supported by using a different diagnostic element for each language, with the appropriate xml:lang language attribute, and referencing all the unique identifiers of the diagnostic elements in the diagnostics attribute of the assertion. Annex G gives a simple example of a multi-lingual schema.

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
<sch:title>Example of Multi-Lingual Schema</sch:title>
<sch:pattern>
<sch:rule context="dog">
<sch:assert test="bone" diagnostics="d1 d2">A dog should have a bone.</sch:assert>
</sch:rule>
</sch:pattern>
<sch:diagnostics>
<sch:diagnostic id="d1" xml:lang="en">A dog should have a bone.</sch:diagnostic>
<sch:diagnostic id="d2" xml:lang="de">Ein Hund sollte ein Bein haben.</sch:diagnostic>
</sch:diagnostics>
</sch:schema>

However, this never worked in the original Skeleton implementation, as it would display both messages and not only the one from the current locale.

oXygen XML has implemented a workaround for this issue with tweaking the original Skeleton implementation and only showing the current locale. Possibly they can contribute this change as a PullRequest.

However, there's another shortcoming of the diagnostic based localization concept: the developer has to actively reference every language with a separate ID in the diagnostics attribute, which makes it hard to add new localizations.

At XML prague, Octavian from oXygen XML (@octavianN), Nico from the SQF project (@nkutsche), Patrik (@PStellmann) & Vanessa (@vanessakastmann) from the DITA-SEMIA project and me sat together to discuss the SQF issue https://github.com/schematron-quickfix/sqf/issues/1 but quickly came to the conclusion, that there needs to be made improvements to the localization support in the Schematron standard or the Skeleton implementation in order to properly resolve issues like the EpubCheck or SQF one.

We discussed the following solutions which I want to outline here as a discussion basis. You should also know, that we discussed this with the usecase of externalizing the messages to separate files (e.g. fro Translation Memory Systems) in mind.

Solution 1: Fix the Skeleton

The Skeleton should be fixed to at least support the Annex G example properly: Only output the message in the current locale and not ALL diagnostic elements.

Solution 2: Remove ID/IDREF constraint from Schematron schema

This is more like a long-term solution as the standardized schema would need to be changed.

What we like to achieve is something like this:

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
    <sch:title>Example of Multi-Lingual Schema</sch:title>
    <sch:pattern>
        <sch:rule context="dog">
            <sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:diagnostics>
        <sch:diagnostic id="d1" xml:lang="en">English message.</sch:diagnostic>
        <sch:diagnostic id="d1" xml:lang="de">German message.</sch:diagnostic>
    </sch:diagnostics>
</sch:schema>
  1. Only reference the message ID (which isn't of datatype ID anymore) once and let the Skeleton or any other implementation choose the proper diagnostic element.
  2. Schematron rule: Enforce the xml:lang attribute with different values when two or more diagnostic elements with the same id are present.

Current status: This does not validate because of the ID/IDREF datatypes.

Solution 3a: Do it the Java way (hacky)

In Java you just reference messages.properties file and the PropertyReader implementation takes care of resolving the current Locale. In a german environment for xample, Java would try and look for messages_de.properties automatically, although this file isn't referenced in the Java class.

Schematron could do this as follows:

dog.sch

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
    <sch:title>Example of Multi-Lingual Schema</sch:title>
    <sch:pattern>
        <sch:rule context="dog">
            <sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:include href="messages.sch"/>
</sch:schema>

messages.sch:

<sch:diagnostics xml:lang="en">
    <sch:diagnostic id="d1">A dog should have a bone.</sch:diagnostic>
</sch:diagnostics>

messages_de.sch:

<sch:diagnostics xml:lang="de">
    <sch:diagnostic id="d1">Ein Hund sollte ein Bein haben.</sch:diagnostic>
</sch:diagnostics>
  1. The Skeleton would need to be changed to look for {include}_{locale}.sch everytime it resolves an include.
  2. That's a bit hacky

Current status: dog.sch would validate without errors, but some of our group had reservations because of the misuse of the include element and also because the german message file messages_de.sch isn't referenced anywhere within the SCH. Personally(!) I could live well with the last one, as it's Java style...

Solution 3b: Do it the Java way (properly)

To address the issue about misusing the include element from solution 3a, I'd like to introduce either a new element for message file references:

<sch:messages href="messages.sch"/>

which would require a diagnostics root element

or at least an additional attribute on the include element:

<sch:include href="messages.sch" type="localization"/>

which would advise Skeleton and any other implementation to look for localized files as well (in the Java form of {include}_{locale}.sch).

Solution 4: Work with business rules for the referenced id's

In my personal opinion this can't be more than a temporary hack, but it was heavily discussed in the group:

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
    <sch:title>Example of Multi-Lingual Schema</sch:title>
    <sch:pattern>
        <sch:rule context="dog">
            <sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:diagnostics>
        <sch:diagnostic id="d1">English message.</sch:diagnostic>
        <sch:diagnostic id="d1_de">German message.</sch:diagnostic>
    </sch:diagnostics>
</sch:schema>
  1. The Skeleton would need to be changed to look for an ID {id}_{locale} diagnostic element if the current locale does not match xml:lang on the root element.
  2. That's more than hacky

Current status: The schematron would validate well.


I layed out different solutions we discussed at our SQF meeting and the more I think about it, the better It would have been to discuss this two days earlier on the Schematron Users Meetup... Anyways...

This should only be a basis for further ongoing discussion and I hope I could make my point why we need improvements to either the standard or the Skeleton.

Kind regards, Tobias

on behalf of Octavian, Nico, Patrik and Vanessa

tofi86 commented 7 years ago

P.S.: I just wrote this down from the top of my head after a long Prague weekend, so I hope I haven't forgotten something. Octavian, Nico, Patrik, Vanessa, please add to the discussion If I missed something!

rjelliffe commented 7 years ago

Yes, schematron needs to select the correct language for diagnostics. If there is a bug, i wil gix it.

As for combined diagnostic files, I suppose there would also be the approach of making the URL for the include/href (more likely to be the new extends/@href) dynamic: Allow {} like {concat ('file://xxxx/diagnostics_', $lang, '.sch')}

Regards Rick

On 13 Feb 2017 10:13, "Tobias Fischer" notifications@github.com wrote:

P.S.: I just wrote this down from the top of my head after a long Prague weekend, so I hope I haven't forgotten something. Octavian, Nico, Patrik, Vanessa, please add to the discussion If I missed something!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Schematron/schematron/issues/40#issuecomment-279259300, or mute the thread https://github.com/notifications/unsubscribe-auth/AX3VKdppHVqEy3fUo_XMnbaX8-DIiBRFks5rb5IUgaJpZM4L-qT7 .

tofi86 commented 7 years ago

Yes, schematron needs to select the correct language for diagnostics. If there is a bug, i wil gix it.

Yeah, at the moment, the default skeleton is not picking up the xml:lang attribute.

Probably Octavian from oXygen XML (@octavianN) is willing to contribute their fixed version?

I think the hardest part would probably be to get the fixed version into third party tools like Jing...

Allow {} like {concat ('file://xxxx/diagnostics_', $lang, '.sch')}

That's also a nice idea of dynamically referencing the external language files.

tgraham-antenna commented 7 years ago

On 17/02/2017 16:01, Tobias Fischer wrote:

Yes, schematron needs to select the correct language for
diagnostics. If there is a bug, i wil gix it.

Yeah, at the moment, the default skeleton is not picking up the |xml:lang| attribute.

Probably Octavian from oXygen XML (@octavianN https://github.com/octavianN) is willing to contribute their fixed version?

When I spoke to Octavian at XML Prague, he said that oXygen did it by filtering messages, not by not emitting the message in the first place.

georgebina commented 7 years ago

Just to clarify, Jing uses inside also a similar approach to Skeleton, but it is a different implementation. Also, there is support only for pre-ISO Schematron, the support for ISO Schematron is very limited, I just enabled that by supporting the new namespace but there is nothing implemented in terms of ISO specific functionality.

georgebina commented 7 years ago

The oXygen implementation is available under oXygen/frameworks/schematron/impl/ with the same license as the skeleton - it is a fork we made many years ago, so you can surely get whatever update we made back into the skeleton implementation.

octavianN commented 6 years ago

I added a pull request with the multilingual support that we have in oXygen, based on diagnostics. The messages are generated automatically in the language specified by the "langCode"e parameter. If there are no messages in the language specified by the "langCode" parameter, all the messages will be generated prefixed by the language.

tofi86 commented 6 years ago

PR #63

Awesome, thanks Octavian! 👍 Looking forward to see this merged!

dmj commented 5 years ago

Solution 5

Use one diagnostic per message and wrap localizations in a foreign element with @‍xml:lang.

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
    <sch:title>Example of Multi-Lingual Schema</sch:title>
    <sch:pattern>
        <sch:rule context="dog">
            <sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:diagnostics>
        <sch:diagnostic id="d1">
          <p xmlns="http://www.w3.org/1999/xhtml">English message.</p>
          <p xmlns="http://www.w3.org/1999/xhtml" xml:lang="de">German message.</p>
       </sch:diagnostic>
    </sch:diagnostics>
</sch:schema>