ambuda-org / vidyut

Infrastructure for Sanskrit software. For Python bindings, see `vidyut-py`.
48 stars 21 forks source link

Increase support for Sanskrit text in Tamil script #99

Closed akprasad closed 5 months ago

akprasad commented 6 months ago

As requested on our Discord, this includes:

@deepestblue Do you know how the anusvara should be rendered in Tamil script? Aksharamukha uses an apostrophe (அம்ʼ), and I hear that U+0B82 should not be used (http://unicode.org/L2/L2012/12018-tamil-anusvara-depr.pdf).

+cc @jamadagni, who I saw I mentioned in an issue in indic-transliteration

akprasad commented 6 months ago

@deepestblue based on #101, how might the visarga be represented in Tamil script, if at all? For reference, Aksharamukha uses U+A789 (MODIFIER LETTER COLON).

deepestblue commented 6 months ago

As you're probably aware, the Grantha script was devised specifically to supplement Tamil with extra characters to accommodate Sanskrit phonemes; in fact (modulo some historical divergence), the common phonemes have near-identical glyphs in Tamil and Grantha, and a mix of Tamil and Grantha glyphs has been used historically to write Manipravalam.

Personally, I think it's not easy to try to support Sanskrit in the Tamil script. If it were up to me, people trying to read Sanskrit in the Tamil script would use Grantha instead :-)

Unfortunately, I don't have a good solution for you, for either anusvara or visarga; I find what Aksharamukha has done a pure hack, with no historical or usage-basis. Perhaps scholars like jamadagni may have other ideas.

akprasad commented 6 months ago

@deepestblue just FYI, Aksharamukha seems to be following the usage described here https://www.unicode.org/L2/L2010/10256r-extended-tamil.pdf , which is referred to in the document as "Extended Tamil conservative version"

deepestblue commented 6 months ago

Hmm ... which section in the document recommends using U+A789 for representing the visarga in Tamil?

akprasad commented 6 months ago

Hmm ... which section in the document recommends using U+A789 for representing the visarga in Tamil?

Ah, none -- sorry for confusion. My comment was regarding Aksharamukha's use of the apostrophe to denote anusvara and a colon character in general for visarga (but not specifically U+A789).

jamadagni commented 6 months ago

Namaste. @virtualvinodh and I decided to use ʼ 02BC MODIFIER LETTER APOSTROPHE and A789 MODIFIER LETTER COLON for their visual similarity to the existing attested characters and their character category of Ml = Modifier Letter, which does not generate a word break.

If we had used the punctuation apostrophe or colon, it would cause a word break as per the Unicode word-breaking algorithm. This would be a problem especially in the middle of a word.

For example try full-word selection methods like double-clicking or Ctrl+cursor keys on து₃꞉க₂ம் (with modifier colon) vs து₃:க₂ம் (with normal colon). On a compliant implementation (like my Firefox on Kubuntu 23.10) the first gets fully selected whereas the second stops at the colon.

Likewise for ஸம்ʼஸ்க்ருʼதம் (modifier apostrophe) vs ஸம்'ஸ்க்ரு'தம் (normal apostrophe).

As explained in my Unicode document mentioned above, the Tamil anusvara was a character that should never have been encoded in the first place. Hence it is best to avoid it for all purposes.

I do not believe that there is any attested consistency in printed texts for the anusvara, but I think I have shown attestation for the apostrophe for ருʼ, so applying it to ம் for the anusvara was our own decision I believe.

But the visarga mostly is written in South Indian scripts as two rings like Kannada ಃ rather than two dots like Devanagari ः though some publications in Tamil do use the two dot form due to convenient available of the colon I guess. We figured that a font which would provide nice typography for Tamil orthography of Samskritam might provide the ringed glyph for A789 (and I use such a modified Lohit Tamil font locally).

IIRC this encoding system we decided back before Grantha was encoded. And normally intermixing of Indic codepoints is discouraged due to the high similarities between scripts. Hence we used only script-neutral characters to extend the script to be immediately usable with existing technology.

But recently the perception among technically minded people seems to be that it is not possible to entirely avoid mixing in the case of Tamil and Grantha for many reasons, one important reason being that even historically the two scripts have been heavily mixed and if you want to represent such inscriptions etc then you need to mix.

You can see The Unicode Standard 15.0 sec 12.6 Tamil page p 513 bottom p 514 top that 1133C double dot nukta and 1133B single dot nukta from the Grantha block are recommended for special usage requirements to be used as part of Tamil script text.

Similarly, maybe today one could use GRANTHA SIGN ANUSVARA and GRANTHA SIGN VISARGA. I think we had also contemplated use of the GRANTHA SIGN CANDRABINDU but I do not recall having written this down.

Hope this helps.

virtualvinodh commented 6 months ago

I concur with Shriramana.

<< I find what Aksharamukha has done a pure hack, with no historical or usage-basis. Perhaps scholars like jamadagni may have other ideas. >>

This is true. But we needed a standard for a non-lossy representation of Sanskrit in the Tamil script. So, Shriramana and I discussed this in detail and we implemented a standard. It's been more than a decade and you will find Sanskrit content in Tamil script using the above convention pretty much widespread on the internet.

I suppose nearly all of them use Aksharamukha. Nevertheless, most Tamil texts with superscript numerals on the internet are likely to use the apostrophe to denote Anusvara and the vocalic vowels. (And the modifier colon for visarga)

In any case, Aksharamukha allows the use of Grantha visarga but it only works with Noto Sans/Serif Tamil.

image

virtualvinodh commented 6 months ago

As a side note, @jamadagni (correct me if I'm wrong) and I would prefer rendering Sanskrit in Tamil infused with the Grantha repertoire. I am not a big fan of the usage of superscript/subscript numerals.

image

V

deepestblue commented 6 months ago

@akprasad so there you have it! Both Shriramana and Vinodh opine (and I do, FWIW) that Sanskrit speakers want to read/write Tamil should use Grantha characters rather than the superscript/subscript notation :-)

jamadagni commented 6 months ago

@virtualvinodh yes I have not changed my preference on this.

@deepestblue Well our opinion may be that people “should” use mixed Tamil+Grantha, but of course we can't force people to.

A project I have been contributing to has been using Tamil+234 as well as Tamil+Grantha for public distribution anushthana documents since a few years. (Thanks to @virtualvinodh for creating the Agastya font for Tamil+Grantha leveraging the existing Malayalam encoding.)

For example see the recent document: https://bit.ly/vdsp-chandra-grahanam

Almost exactly one year ago, when we tried stopping including Tamil+234 documents to “encourage” people to use Tamil+Grantha, we started getting a whole lot of messages asking “where is the Tamil document?” and so we had to instate the Tamil+234 documents.

So obviously people don't recognize Tamil+Grantha as “Tamil”. It is admittedly a mixed script.

So we prefixed the Tamil+234 documents with a page explaining the need for Tamil+Grantha and recommending to use it, and left it at that. You can see both kinds of documents in the above link.

Some people told us: “See we are old, don't expect us to learn anything new (however minimal effort may be required). Shouldn't you help us?”

So rather than leaving people in the lurch, and make them have difficulty in doing some anushthanam (whether with correct pronunciation or not), we just prefixed the Recommendation Notice.

I guess whatever lipi you use, even if Tamil+Grantha or full Grantha, proper pronunciation depends on proper training. The lipi is at best a good sādhanam to help you along the way. Tamil+Grantha is much better than Tamil+234 at being such a sādhanam, that's all. That's why we recommend it. But it is up to the audience to read it correctly.

For example, even if you use IAST to write a verse, do you expect people – whether Bharatiyas or others – to pronounce it correctly just because the diacritics are used? Mostly lay people ignore diacritics, which is exactly the problem in the case of Tamil+234.

Diacritics are at best an academic device to disambiguate in case of doubt. 234 are also just diacritics. So only if someone has a doubt about the அர்த்த₂ (not அர்த்₃த₄) of யதா₂ or யதா₃ they will check the (fineprinted) number. Otherwise they will just do a “யதார்த்த” reading and move on. 🙂

So I feel we should certainly encourage people to use Tamil+Grantha, maybe by placing it higher up than Tamil+234 on the menu or such, but we cannot force people by denying the option to use Tamil+234.

Technology may not be an entirely appropriate or efficient tool for orthography reform (though it may help somewhat). If your software doesn't support what people want, they may simply move to another solution. Better to provide them some good solution – even if it isn't what we consider ideal – in addition to the ideal. Once you retain the audience, hopefully slowly you can convince them.

Hope I have not bored you people. Ram Ram.

deepestblue commented 6 months ago

@jamadagni thanks for your wise response to my some what flippant post :-) I agree 100% with your philosophy of meeting the users where they are.

akprasad commented 5 months ago

Thank you all for the wonderful discussion. I will work on implementing this now.

akprasad commented 5 months ago

I have a local implementation of Tamil+234. Initial tests indicate that it is equivalent to the implementation in Aksharamukha.

I will close this issue when the code is merged.

akprasad commented 5 months ago

Merged with test cases. A workable version that solves the items in my first message (superscripts and anusvara) is live, but I see clear areas for improvement, particularly in the system's naive treatment of ந within a word (should prefer ன).

I'll close this issue since the work in the first message is complete, but I'll fix this ந/ன issue as soon as I can.