Ambiguities in parsing Sinhala named sequences (repaya, Yansaya and Rakararnasaya)

pathumego commented 5 years ago

There are three Unicode named character sequences for Sinhala defined in Unicode Standard 6.1.0. Due to lack of explicit definitions there are inconstancies in Opentype shaping. However these inconsistencies lead to actual errors or users assuming there are errors with Sinhala Unicode Specification, fonts, keyboards and input methods. I am consolidating some of my findings and ideas here.

Three Sinhala named sequences for Sinhala added in Unicode Standard 6.1.0;

SINHALA CONSONANT SIGN YANSAYA; 0DCA 200D 0DBA
SINHALA CONSONANT SIGN RAKAARAANSAYA; 0DCA 200D 0DBB
SINHALA CONSONANT SIGN REPAYA; 0DBB 0DCA 200D

To visualise this;

SINHALA CONSONANT SIGN YANSAYA;      ්‍  + zwj + ය
SINHALA CONSONANT SIGN RAKAARAANSAYA;    ්‍  + zwj + ර
SINHALA CONSONANT SIGN REPAYA;       ර  + ්‍   + zwj

These are not included in the Core Specification at the moment and there are ambiguities in how to parse these.

The string has two possibilities to parse 0DBB 0DCA 200D 0DBA ( ර + ්‍ + zwj + ය )

1a. 0DBB 0DCA 200D + 0DBA   →  Repaya + Ya (ර්‍ය) 

1b. 0DBB + 0DCA 200D 0DBA   →  Ra+ Yansaya (ර ්‍ය*)

NOTE:* Added space (0020) between ර and ්‍ for demonstration.

Similarly,

The string 0DBB 0DCA 200D 0DBB (ර + ්‍ + zwj + ර) could be parse as both;

2a. 0DBB 0DCA 200D + 0DBB   →  Repaya + Ra (ර්‍ර)

2b. 0DBB + 0DCA 200D 0DBB   →  Ra + Rakaaraansaya  (ර‌්‍ර)

NOTE:** Added space (0020) between ර and ්‍ for demonstration. NOTE: The syllable r-ra is not a common occurrence in Sinhala.

Above named sequences are not in The named sequences There is no explicit description of how these two strings should be dealt with in the (proposal) the SLS 1134:2011 (2011 revision) specification.

However SLS 1134:2011 Section 5.9, p22 on Repaya has following explanation ;

NOTE: Screenshot from the PDF to avoid Sinhala string display errors

3. 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 
( ර    + ්‍  + zwj + ය + ්‍ + zwj + ය  )

The wording and the example dose not solve the issue of ambiguity explicitly because it refers to another special case with the visual form ‘yansaya with a repaya’.

We could consider that SLS 1134:2011 Section 5.9 implies that 0DBB 0DCA 200D 0DBA should be parsed as 1a and also consider linguistically Ra+Yanasaya is incorrect and make 1a the default parse. However, if we update specification to make 1a the default parse it raises another issue; how do we encode Ra+Yansaya combination? We need to display ’things that should not exist’ or ‘incorrect strings‘ for linguistic or technical contexts.

When it comes to 0DBB 0DCA 200D 0DBB (ර + ්‍ + zwj + ර) there are no references in SLS 1134 . However 2b might be desirable for the default parse against the 2a. Both the R-Ra and Ra-Ra (ie. ක්‍රර) are not practically common syllables and both (ර්‍ර) (්‍රර) are not common occurrences. But that is up to the linguists to decide.

Following is a solution; A. Define default parse of 0DBB 0DCA 200D 0DBA ( ර + ්‍ + zwj + ය ) as the Repaya + Ya (above 1a) B. Define a way to encode Ra+Yanasaya C. Define a default parse for 0DBB 0DCA 200D 0DBB (ර + ්‍ + zwj + ර) and use the same strategy as B to encode the other possibility. Which one is the default is a question that we can ask Local Languages Working Group of Sri Lanka. D. Update Harfbuzz to new spec F. Update all fonts spec

Related links and references

Proposal for Sinhala named character sequences
SLS 1134
Discussions on Harfbuzz issue tracker

lianghai commented 5 years ago

Directly defining a set of prioritized parsing rules can be helpful to eliminate the issue of implementations (because the major implementation today, OpenType, is based on prioritized parsing stages) having to interpret implied logic for marginal cases. (And the easier-to-understand version like “default parse of …” can always be supplied together.)

2b. 0DBB + 0DCA 200D 0DBB → Rakaaraansaya + Ra (්‍රර)

A typo of swapping. Should be “Ra + Rakaaraansaya”.

pathumego commented 5 years ago

A typo of swapping. Should be “Ra + Rakaaraansaya”.

Fixed

n8willis commented 3 years ago

I don't want to create unnecessary noise, but I am interested in tracking the progress on this topic. It looks like the proposed fixes haven't (yet) appeared in Unicode or the MS docs themselves. But is that even intended to be the path "forward"? Would getting positive feedback from the Local Languages Working Group be sufficient to say "this is how it should be?"

pathumego commented 3 years ago

@n8willis I am working on updating the SLS 1134 to reflect the changes and working with @lianghai on updating Unicode docs. Will post the progress here.

pathumego commented 3 years ago

@lianghai Is using ZWNJ to encode above 1b a sensible solution? So the Unicode string to display Ra+Yansaya would be as follwoing;

4. 0DBB ra, 200C zwnj,  0DCA al-lakuna, 200D zwj, 0DBA ya   →  Ra+ Yansaya (ර ්‍ය*)

Dose this mean enabling parsing 200C zwnj, 0DCA al-lakuna, 200D zwj, 0DBA ya string as an independent Yansaya. Allowing it to be shape even if it is orthographically incorrect?
So the following string will render as such as 0DB4 pa, 0DCA al-lakuna + 200C zwnj, 0DCA al-lakuna, 200D zwj, 0DBA ya ප්්‍ය will shape as 6, instead of 5 (the current shaping) (6 is orthographically incorrect)

akuru / sinhala-unicode-technical

Ambiguities in parsing Sinhala named sequences (repaya, Yansaya and Rakararnasaya) #6

Related links and references