Ezhil-Language-Foundation / open-tamil

Open Source Tamil NLP Tools - தமிழ் இயற்கை மொழி பகுப்பாய்வு நிரல்தொகுப்பு
http://tamilpesu.us
MIT License
262 stars 80 forks source link

Tamil ISO 15919 to Tamil Uniocode conversion #253

Closed Natkeeran closed 1 year ago

Natkeeran commented 1 year ago

Tamil ISO 15919 standard is often used to convert Tamil text to romanized text. This specially the case in many cataloguing systems. Example: https://catalog.hathitrust.org/Record/6133883. Generaly Tamil public is not familiar with this standard. Is there any tool that can take the ISO 15919 romanized text and convert it into Tamil unicode text? (https://www.ushuaia.pl/transliterate/?ln=en in reverse)

tshrinivasan commented 1 year ago

Please provide a mapping table for ISO 15919 romanized text to Tamil unicode text.

check here for existing mapping tables. https://github.com/Ezhil-Language-Foundation/open-tamil/blob/main/tamil/txt2unicode/encode2utf8.py

if we get the mapping, we can add them to open-tamil easily.

Natkeeran commented 1 year ago

@tshrinivasan

There is a mapping from Tamil to ISO 15919 (https://en.wikipedia.org/wiki/ISO_15919) in Open Tamil (https://github.com/Ezhil-Language-Foundation/open-tamil/issues/233). Here, need the reverse ISO 15919 -> Tamil Unicode.

This converter comes close: http://aksharamukha.appspot.com/converter.

arcturusannamalai commented 1 year ago

Simple solution:

When the following code works to transliterate into English

from transliterate import azhagi, jaffna, combinational, UOM, ISO, itrans, algorithm
  ISO_table = ISO.ReverseTransliteration.table
  expected = 'cāmi. citamparaṉār nūṟ kaḷañciyam'
  tamil_str = "சாமி. சிதம்பரனார் நூற் களஞ்சியம்"
  eng_str = algorithm.Direct.transliterate(ISO_table,tamil_str)
  print(eng_str)

the succeeding code can be used to reverse the transliteration:

from transliterate import algorithm as tx_algo
rev_table = tx_algo.reverse_transliteration_table(ISO_table)
new_tamil_str0 = algorithm.Direct.transliterate(rev_table,eng_str)
print(new_tamil_str0)

however this is not sufficient, so we do the following,

new_tamil_str1 = algorithm.Iterative.transliterate(rev_table,eng_str)
print(new_tamil_str1)
arcturusannamalai commented 1 year ago

as executed on Colab with Open-Tamil v1.1

Screen Shot 2022-09-09 at 11 59 40 PM