cburgmer / cjklib

Han character library for CJKV languages
Other
149 stars 49 forks source link

Get Yale readings #15

Open LawranceFung opened 3 years ago

LawranceFung commented 3 years ago

Since Yale encodes the difference between the high level and high falling tones but Jyutping doesn't, would it be possible to get the Yale readings directly?

cburgmer commented 3 years ago

Hey, I am not maintaining this repo anymore. And it's been a while so my memory will be partially wrong.

This library relies heavily on the Unicode Unihan database, and they do not include Yale readings AFAIK: https://www.unicode.org/reports/tr38/index.html#kCantonese.

LawranceFung commented 3 years ago

That's fine. This python library is functional enough for my needs so I'm fine with no more updates.
I'm getting different results between calling a python script in windows cmd and pasting python interactively. Specifically, yale romanizations with high level tone seem to fetch the characters with high falling tone when called from a *.py but return the correct and distinct results when called from the interactive python shell. On a related note, when a romanization in jyutping with tone number 1 (corresponding to the high level and high falling tones) is queried through a python script, it only returns the result for high falling tone when the list should have both high falling and high level. Any idea what's up with that? It possibly has something to do with differing environment variables?
Also, if the Unihan database doesn't have Yale readings, how did you programatically distinguish between high falling and high level?

cburgmer commented 3 years ago

On a related note, when a romanization in jyutping with tone number 1 (corresponding to the high level and high falling tones) is queried through a python script, it only returns the result for high falling tone when the list should have both high falling and high level. Any idea what's up with that?

Sorry, I believe this specific question I cannot answer without reading the code more. Feel free to dig in and ask about specific areas in the code if you get stuck though!

how did you programatically distinguish between high falling and high level?

I'm not sure this is helping, but the tone logic should basically boil down to this code:

    DEFAULT_TONE_MAPPING = {1: '1stToneLevel', 2: '2ndTone', 3: '3rdTone',
        4: '4thTone', 5: '5thTone', 6: '6thTone'}

So maybe the case you are asking about is not covered? From a Jyutping perspective that might be correct as this system chose not to represent this case, from a Yale perspective however then it's wrong.

LawranceFung commented 3 years ago
# -*- coding: UTF-8 -*-
# cjklib is only compatible with python 2; call it with py -2 Query_7_per_cjklib.py or at least that's what I thought I was supposed to do until the command line is giving me the correct results when I enter the python directly and calling the script isn't
# cjklib is the only thing I could find on github that claimed to correctly handle Yale's high falling/high level distinction
import sys
from cjklib import characterlookup
print sys.version_info
# set locale as traditional
cjk = characterlookup.CharacterLookup('T')

f = open('cjklib_seven.txt', 'w')
sys.stdout = open('C:/Users/Public/output.txt', 'w')
print(u'tìm'.encode('UTF-8'))
print(cjk.getCharactersForReading('tìm', 'CantoneseYale'))
print(u'tīm'.encode('UTF-8'))
print(cjk.getCharactersForReading('tīm', 'CantoneseYale'))

Is what's giving me different results in interactive mode and when called as a .py file. I have both python 2.7 and python 3.8 installed.

LawranceFung commented 3 years ago

I think I figured out the issue - whatever encoding I have that gets passed to the interactive process for Python doesn't support precomposed characters with a macron, which was treating them as an unaccented character (tone 3) instead. Many characters can be pronounced with high level or high falling tones, sometimes for every character for a particular reading, which didn't help when I tried comparing the results of queries to see if cjklib was processing queries with the high level and high falling tone differently. So, not a bug in cjklib afaik Where does cjklib get the dictionary data to distinguish the high level and high falling tone in Yale? I checked cedict, cedictgr, handedict, cfdict, unihan, and kanjidic2 and none of them show it. Actually, I just checked all the possible yale syllables and it seems cjklib only distinguishes high falling from high level in recognizing that only high level can occur when the syllable ends with p t or k