Ezhil-Language-Foundation / tamilpesu_us

http://tamilpesu.us - Open-Tamil hosted like a BO$$.
GNU Affero General Public License v3.0
5 stars 5 forks source link

Few thousand documents written (1995 to 1999) using TamNet.ttf from the old irdu.nus.sg archives [Advise Needed] #15

Open ashbeats opened 11 months ago

ashbeats commented 11 months ago

Hi,

My name is John. And I have been attempting to convert a few thousand documents and articles that were written in the TamNet.ttf bilingual font, that was released in 1995. The original authors are no longer around, and I have been attempting to find information about the keyboard mapping to write a converter or find an existing converter it to Unicode standard encodings.

Do you know of a converter? I tried your Open-Tamil lib, but it failed to recognise the text or the conversions were not fully accurate.

I understand that the encoding is also questionable as the documents were moved between various formats over the years, such as ansi. So I have preserved them from the originals, and have been inspecting it in binary and comparing it to the same text's written in the TamNet99 formats, and the Murasu formats.

The closest seems to be TamNet99, from google searches and papers, however, there may be edge cases that may elude me.

And insights or direction would be most appreciated.

Best Regards, John

arcturusannamalai commented 11 months ago

Hi there- Interesting topic; I'll post to my Twitter. Perhaps NUS can announce a bug bounty and have some engineers take a look.

In the past what has helped is the following:

I'm sure someone can crack this problem with sufficient effort and motivation. Thanks

tshrinivasan commented 11 months ago

@ashbeats I can explore on this. Share the font ttf file and few sample documents.

gchandra10 commented 11 months ago

HTML / JS script should help. Please try.

https://www.suratha.com/reader.htm

ashbeats commented 11 months ago

Hi,

Thank you for responding.

The documents just been restored to the original website: https://kanian.com

And another archived site, holds a bit more information: https://ccat.sas.upenn.edu/plc/tamilweb/

The fonts are available for download here: https://ccat.sas.upenn.edu/plc/tamilweb/download.html

and GPT4 had this to add...

... The TAMNET.ttf font is based on the TAM encoding system, which stands for Tamilnet. It was developed by Mr. Naa Govindasamy, an expert in Tamil encoding, and was released in 1995 by the Institute of Research in Digital Units (IRDU) in Singapore.

TAMNET.ttf is a TrueType font that uses a unique encoding scheme to represent Tamil characters. It deviates from the traditional Tamil encoding systems like TSCII (Tamil Standard Code for Information Interchange) or TAM (Tamil Monolingual Keyboard). Instead, it introduces a new layout that is optimized for ease of use and compatibility with the ASCII character set.

In TAMNET.ttf, the Tamil characters are mapped to the traditional QWERTY keyboard layout, where each key represents one Tamil character. For example, pressing the 'a' key outputs the Tamil character 'அ', 'b' outputs 'ப', 'c' outputs 'ச', and so on. This layout made it convenient for users familiar with the English keyboard layout to type Tamil characters without the need for any additional hardware or input methods.

TAMNET.ttf gained popularity during the late 1990s and early 2000s as it provided an easy-to-use encoding system ...

tshrinivasan commented 11 months ago

Found charecter map of tamilnet.ttf file here https://fontsdata.com/76760/tamilnet.htm

Exploring on that how to use that table for unicode conversion.

arcturusannamalai commented 11 months ago

thanks folks; @tshrinivasan - if you find a fix please post a PR to open-tamil also

tshrinivasan commented 11 months ago

Working with udhayam.in udhayan to get the mapping for this font.

Will update here on the progress soon.

arcturusannamalai commented 7 months ago

@ashbeats - do you still need this feature ? did you make any progress ?

ashbeats commented 5 months ago

@arcturusannamalai I do, but the project is on hold.