dhanika / sett-browser

Automatically exported from code.google.com/p/sett-browser
0 stars 0 forks source link

Bugs found with Tamil range in SETT browser #4

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Firstly I congratulate Dhanika Perera for this fine creation that is going to 
be very useful till Android developers provide us full and seamless system-wide 
support for complex scripts rendering.

There are a few bugs I have found for Tamil rendered by SETT with 
DhanikaSETT.ttf. My tests were during this past weekend 22,23-Jan-2011. 

The SETT browser version used is the current 1.1.1 installed to Android 
Emulator in Android SDK using Android Platform 2.3 (API Level 9) in my Ubuntu 
10.10 (Maverick) Desktop computer. (I do not have an Android cellphone so far 
but for this testing the emulator on PC desktop is very much sufficient). 

I also svn checked out the source for SETT to look at the DhanikaSETT.ttf font 
by opening it under fontforge to see how and where ligatures are placed for 
mapping of code point sequences that need complex rendering in normal Unicode 
implementation.  

I checked for the entire set of Tamil characters & syllables as well as 
symbols, digits and numbers in Tamil Unicode range by looking at my tabulation 
of them via the SETT Browser in Android emulator. My tabulation as a website is 
at the following URL:

url: http://sites.google.com/site/tamilincomp/ta_sy
short url: http://is.gd/v7Jloi

As I mention in that page, to look at that page in a computer desktop usual 
Unicode enabled browser, it is better to have the page rendered with Lohit 
Tamil font since it is comprehensive in coverae of the Tamil range. If you have 
the older versions than 2.4.4 then it is better to download (from 
fedorahosted.org site) the current version 2.4.5 and use/ 

If you wish to, for reference you can download a pdf version for the above page 
from my site: 

url: http://sites.google.com/site/tamilincomp/files/tam-sy.pdf
short url: http://is.gd/lCzc62

Now in this message I will point out the first 5 bugs with two screen shots and 
make my suggestions for solutions. I will continue later during next few days 
with the remaining bugs (about 5 more) with screenshots. 

Each screen shot attached to this mail has parts of my tabulation in above 
mentioned web site viewed via Firefox browser (in Ubuntu-10.10) on left side 
and viewed via SETT browser in Android emulator on right side.

Bug 1 - See attached Screen shot file: bugs_1-3.png
-------------------------------------------------------------------
For column numbered 2 (consonant ங - U+0B99) rows numbered 5 and 6 
respectively the two syllables (formed of ங with u and uu vowel sounds), 
namely {U+0B99 U+0BC1} and  {U+0B99 U+0BC2} respectively the correct ligatures 
ஙு and ஙூ are not seen in SETT browser. Instead I see that they are 
wrongly mapped respectively to ligature ழு which is at U+0BAD in your font 
and ligature ழூ which is at U+0B96

Looking at the font in fontforge, I see that you need to add a ligature for 
ஙு in a vacant code-point for mapping {U+0B99 U+0BC1} to it and then from 
it to generate ஙூ  {U+0B99 U+0BC2} analogous to the pairs of பு,பூ 
(column 9 , rows 5 & 6), யு,யூ (column 11 , rows 5 & 6) and வு, 
வூ (column 14 , rows 5 & 6)

Bug 2 - See attached Screen shot file: bugs_1-3.png
-------------------------------------------------------------------
Column 3, Row 6 - For சூ  {U+0B9A U+0BC2} in SETT browser the mapping is to 
சு஗ that is code points sequence {U+0B9A U+0BC1 U+0B97 } - But in the 
font you already have the correct ligature ஋ at U+0B8B. 

So correcting the conversion mapping to U+0B8B instead of {U+0B9A U+0BC1 U+0B97 
} should rectify this bug

Bug 3 - See attached Screen shot file: bugs_1-3.png
-------------------------------------------------------------------
Column 4, rows 5 and 6 respectively - ஞு (U+0B9E U+0BC1), ஞூ (U+0B9E 
U+0BC1) the SETT browser shows wrong ligatures ழு, ழூ respectively. 

Here also you need to add a ligature for ஞு  in a vacant code point for 
mapping {U+0B9E U+0BC1} to it and then from it to generate ஞூ (U+0B9E 
U+0BC1) analogous to the pairs of ணு,ணூ (column 6, rows 5 & 6) , 
து,தூ (column 6, rows 5 & 6) etc (there are a few more analogous pairs)

Bug 4 - See attached Screen shot file: bugs_4-5.png
-------------------------------------------------------------------

Column 10 row 6 - For மூ {U+0BAE U+0BC2} mapping in SETT browser is to 
wrong ligature for ழூ at code-point U+0B96. But you have the correct 
ligature for மூ at the code-point U+0B8D. So need to correct the mapping of 
{U+0BAE U+0BC2} to U+0B8B

Bug 5 - See attached Screen shot file: bugs_4-5.png
-------------------------------------------------------------------
Column 16 row 6 - For ளூ {U+0BB3 U+0BC2} mapping in SETT browser is to 
ளு௓ {U+0BB3 U+0BC1 U+0BD3} which is not correct. I see that you need to 
add a ligature for ளூ  in a vacant code point for mapping {U+0BB3 U+0BC2} 
to it.

Will continue soon during next coupe of days

K. Sethu
Colombo

Original issue reported on code.google.com by skhome@gmail.com on 23 Jan 2011 at 8:02

Attachments:

GoogleCodeExporter commented 9 years ago
Just noticed that in my report numbering the bugs  as Bug 1, Bug 2 etc was not 
a good idea because they actually conflict with older bug reports.  My Bug 1 
and Bug 2 are shown with blue over-strike which is perhaps used for indicating 
solved.  So we need to fix this. Can you suggest / implement a way out?

K. Sethu

Original comment by skhome@gmail.com on 23 Jan 2011 at 8:10

GoogleCodeExporter commented 9 years ago
Hereafter I will denote the bugs as 6th bug, 7th bug... etc to avoid the 
problem I mentioned in comment #2 above

There are two screen-shot attachments to this message.

6th bug - See attached Screen shot file: bugs_6-7.png
--------------------------------------------------------------------------------
--
Column 19 - The addition of the glyph for consonant character ஶ to U+0BB6 is 
required 

7th bug - See attached Screen shot file: bugs_6-7.png
--------------------------------------------------------------------------------
-----
Column 24 : on க்ஷ and க்‌ஷ

The code points sequence for conjunct ligature of KSSA is {U+0B95 U+0BCD U+0BB7}
The code points sequence for split form of KSSA is {U+0B95 U+0BCD U+200C U+0BB7}

In my tabulation in column 24 conjunct form is used and in column 25 it is 
split form having the ZWNJ (U+200C) in between U+0BCD and U+0BB7. 

As it is seen in the screen shot in SETT browser both forms appear split. 

You need to add the ligature க்ஷ for the conjunct form in a vacant slot 
and map the sequence U+0B95 U+0BCD U+0BB7 to it. Also, after such change, it 
should be assured that the sequence {U+0B95 U+0BCD U+200C U+0BB7} would 
continue to be mapped to the split form.

8th bug - See attached Screen shot file: bugs_8-10.png
--------------------------------------------------------------------------------
-----
see under "Ligature for Srii/Shrii" (ஶ்ரீ / ஸ்ரீ ) 

The grnadha script equivalent to Sri is a ligature of sequence of Unicode code 
points. The sequence was changed from Unicode version 4.1 onwards as follows:

Before Unicode version 4.1 - it was  {U+0BB8 U+0BCD U+0BB0 U+0BC0} and after 
ver 4.1, the current standard definition is {U+0BB6 U+0BCD U+0BB0 U+0BC0} 

The old ligature definition has not been deprecated in fonts. Most fonts still 
haven't included the current definition other than Lohit Tamil, recent MS Latha 
& Arial Unicode MS and few others which have included current definition but 
they retain old alsofor backward compatibility. 

Further, although Sri Lanka's SLS standard for Tamil and Tamil Nadu Govt's 
Unicode implementation standard specify that the current standard be used in 
key-maps,  there are plenty of key-maps which have not made the switch.

So I recommend that for the present the older definition  {U+0BB8 U+0BCD U+0BB0 
U+0BC0} mapping to the Sri ligature be continued and additionally include the 
current standard {U+0BB6 U+0BCD U+0BB0 U+0BC0} mapping to the same ligature. 
(note that even if the 6th bug mentioned above is not rectified, this addition 
can be made)

9th bug - See attached Screen shot file: bugs_8-10.png
--------------------------------------------------------------------------------
-----------
There are 9 Tamil symbols to be added to DhanikaSETT.ttf in their respective 
Unicode code point slots. They are:

ௐ (U+0BD0),  ௳ (U+0BF3), ௴ (U+0BF4),  ௵ (U+0BF5), ௶ (U+0BF6), ௷ 
(U+0BF7)  ௸ (U+0BF8),  ௹ (U+0BF9),  ௺  (U+0BFA)

10th bug - See attached Screen shot file: bugs_8-10.png
--------------------------------------------------------------------------------
-----------
All the Tamil digits and numbers are covered by SETT browser except for the 
Tamil digit zero at Unicode code point U+0BE6 and so it has to be added.

Hope my reports here are sufficient and clear for further actions.

K. Sethu

Original comment by skhome@gmail.com on 25 Jan 2011 at 5:20

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks a lot to K. Sethu for reporting these Tamil rendering issues clearly in 
detail.

First I have to mention that I have a little knowledge about the Tamil language 
since I'm a Sinhalese. I can only read & write Tamil, therefore I needed this 
browser to be tested by a Tamil language specialist from the beginning of the 
project. It seems that you have studied my work well & have a good knowledge of 
what I've done. I thank you again for your contribution on this.

In order to implement the Tamil rendering support in SETT Browser I got the 
support from an existing mapping algorithm implemented to map characters 
between 'Latha' Unicode font & 'Bamini' legacy font.

Go to this url & view the source of the page.
http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_u-b.xml

And to create the DhanikaSETT.ttf custom font, I used the Tamil glyphs from 
'Bamini' Tamil font. Since I used those 2 existing resources, I assumed that 
they cover all the ligatures in Tamil Unicode. But with your report I 
understood that 'Bamini' font haven't had all the required glyphs to map all 
the ligatures in Tamil Unicode. And that existing mapping also hadn't covered 
all the Tamil ligatures.

To fix all these rendering issues I need several things from a Tamil language 
specialist. Hope you will help me for this.

1. A 3 column table consists of the following columns.

   * Column1: Missing Tamil ligatures in SETT Browser (No need to mention the Unicode value, just type it)
   * Column2: The sequence of symbols in a Tamil legacy font (eg. Bamini) to map that particular ligature
   * Column3: Weather all the required symbols for the particular ligature already exist in the DhanikaSETT.ttf fonts or what are to be added

Note: Preparing a table like this will make easy the fixing of these bugs. You 
can prepare an Excel table with these columns & attach it to this issue thread.

2. If you too accept that 'Bamini' font doesn't have all the symbols required 
to map all the Tamil ligatures, please suggest me a better legacy font (not a 
Unicode font) which consists of all the required symbols.

Note: If you are suggesting me an alternative font, please use that font for 
the Column2 of above No.1's table. Otherwise you can use the 'Bamini' font font 
that.

I will start working on this issue once I received the Missing ligature table 
from you. You can add additional details to that table if you need to explain 
something.

Looking forward for the table from you.

Thanks again!
Dhanika Perera

Original comment by dhanikap...@gmail.com on 25 Jan 2011 at 9:46

GoogleCodeExporter commented 9 years ago
Dhanika >> //2. If you too accept that 'Bamini' font doesn't have all the 
symbols required to map all the Tamil ligatures, please suggest me a better 
legacy font (not a Unicode font) which consists of all the required symbols.//

Why not Unicode font which is GPLed?. You only need to cull missing glyphs?

K. Sethu

Original comment by skhome@gmail.com on 25 Jan 2011 at 11:19

GoogleCodeExporter commented 9 years ago
A legacy font instead of a Unicode font has been used to extract the glyphs 
because of the following reasons:
  * When the symbols (parts of ligatures) from a legacy font are used, a limited number of font glyphs can be used to map all the ligatures.
  * If the composite glyphs from a Unicode font is used instead, a large number of glyphs has to be added to DhanikaSETT.ttf font to represent all the ligatures uniquely. Then the size of the font file will be increased & that will cause problems when the font gets automatically downloaded to the browser.
  * And also the free spaces from several number of non-Tamil script ranges have to be used to place all those glyphs & that will be problematic when the SETT Browser extends its language support for other complex scripts in future.
  * Another thing is that will also increase the lines of code in the mapping algorithm & that will also be problematic & cause the rendering get slow down.

Therefore I prefer a GPLed legacy font. I would be glad if you can suggest me 
such a font having all the required glyphs. Thanks! 

Original comment by dhanikap...@gmail.com on 26 Jan 2011 at 4:16

GoogleCodeExporter commented 9 years ago
After a deep research on this issue, it was decided to ignore this issue since 
the reported - not supported characters are only being used in advanced Tamil 
text & assuming that this web browser is not for the purpose of reading 
advanced Tamil text since this is a mobile web browser. All the Tamil 
characters which are used in normal Tamil text are already available in this 
browser & therefore the normal users will not be affected by this issue. 

Original comment by dhanikap...@gmail.com on 21 Mar 2011 at 4:50