alpha-rudy / taiwan-topo

Taiwan Hiking Maps
58 stars 7 forks source link

Display names in Pinyin #60

Closed hegezolee closed 4 years ago

hegezolee commented 4 years ago

Hi Rudy,

Is there a way to compile this so that places are displayed in pinyin and not traditional Chinese? My European bought Fenix 6 can't display Chinese characters and the generated gmapsupp only has question marks.

Thanks in advance.

alpha-rudy commented 4 years ago

Hi, @hegezolee , after a few surveys, I found it could be done.

It would need python's osmium and pinyin related modules to convert a pbf in Traditional Chinese to Pinyi.

I'm wondering if you could help and do the work? :)

alpha-rudy commented 4 years ago

BTW, I'm also looking for help in our facebook group. XD

bdon commented 4 years ago

MOE moedict may have more Taiwan place names, but CC-CEDICT may also be an option. The licenses are slightly different but one or both may be OK. I think the python pinyin module uses a hardcoded dictionary that does not have many place names. An approach may be to compile moedict to a lookup-friendly format that can resolve a name: tag and potential 破音字 using longest substring match. I have been working on this approach here https://github.com/bdon/hanzi2reading

I may look into building a python module for this that could be integrated seamlessly with pyosmium. A benefit of this design could be that it also supports other systems such as Bopomofo, Gwoyeu Romatzyh, etc.

gromsy commented 4 years ago

But why did/do previous versions of Rudy's map already display correctly in pinyin in the past? I have been using a Garmin Fenix 5X with Rudy's map installed for a couple of years now and all place names are always displayed correctly with pinyin.

Has something changed with recent versions?

hegezolee commented 4 years ago

But why did/do previous versions of Rudy's map already display correctly in pinyin in the past? I have been using a Garmin Fenix 5X with Rudy's map installed for a couple of years now and all place names are always displayed correctly with pinyin.

Has something changed with recent versions?

You're probably using an APAC firmware that can display Chinese characters. I got mine in Europe and after compiling the bw version of the map, all names are just question marks.

alpha-rudy commented 4 years ago

MOE moedict may have more Taiwan place names, but CC-CEDICT may also be an option. The licenses are slightly different but one or both may be OK. I think the python pinyin module uses a hardcoded dictionary that does not have many place names. An approach may be to compile moedict to a lookup-friendly format that can resolve a name: tag and potential 破音字 using longest substring match. I have been working on this approach here https://github.com/bdon/hanzi2reading

I may look into building a python module for this that could be integrated seamlessly with pyosmium. A benefit of this design could be that it also supports other systems such as Bopomofo, Gwoyeu Romatzyh, etc.

tag @poi890poi

bdon commented 4 years ago

Had the chance to work on this a bit today -you can do pip install hanzi2reading (now version 0.0.3) which will give you a command line utility, that takes a character string as input

hanzi2reading 行銀行
> xíngyínháng

This is based on https://github.com/g0v/moedict-data so the database license there would apply.

You can use this directly as a python library. Here is a pyosmium script that will convert a osm.pbf to a new one, adding the aux:pinyin for every name tag. You can grab a small minutely area pbf from https://protomaps.com/extracts

import osmium
import sys
from hanzi2reading import Reading

reading = Reading()

def annotate(obj):
    if 'name' in obj.tags:
        new_obj = obj.replace()
        d = dict(obj.tags)
        d['aux:pinyin'] = reading.get(d['name'])
        new_obj.tags = d
        return new_obj
    else:
        return obj

class Handler(osmium.SimpleHandler):
    def __init__(self, writer):
        super(Handler,self).__init__()
        self.writer = writer

    def node(self,n):
        self.writer.add_node(annotate(n))

    def way(self, w):
        self.writer.add_way(annotate(w))

    def relation(self,r):
        self.writer.add_relation(annotate(r))

writer = osmium.SimpleWriter(sys.argv[2])
Handler(writer).apply_file(sys.argv[1])
writer.close()

My implementation is very primitive and probably buggy, it only does a greedy prefix search over the moedict entries. I would be interested in if you could QA the output and compare to other pinyin libraries. Also note that proper pinyin should have segmentation and proper nouns, so táiběishì should be Táiběi shì, but I don't know how that important that is for this application (I may consider adding this to hanzi2reading, but it depends on having reliable data.)

alpha-rudy commented 4 years ago

Wow, @bdon , superb, thanks!

Let me try and integrated it into the build process. ^^

alpha-rudy commented 4 years ago

Hi, @bdon , I had a few tests, it works great. ^^

>>> from hanzi2reading import Reading
>>> reading = Reading()
>>> reading.get('來來')
'láilai'
>>> reading.get('台北車站')
'táiběichēzhàn'
>>> reading.get('玉山')
'yùshān'
>>> reading.get('無雙社')
'wúshuāngshè'
>>> reading.get('白冷圳')
'báilěngzùn'

Garmin has a protection policy. It inhibits the Unicode map without protection lock. At the calling to 'mkgmap,' I could only use code page 1252 for English and code page 950 Traditional Chinese.

Code page 1252: https://en.wikipedia.org/wiki/Windows-1252 Code page 950: https://en.wikipedia.org/wiki/Code_page_950

In case I need to convert to ASCII, could hanzi2reading output characters without tone marker (ex: 'taibeichezhan')?

Thanks!

bdon commented 4 years ago

Is this a mkgmap issue or Garmin region lock issue? I think mkgmap can take a --unicodeoption like I did here: https://github.com/hotosm/osm-export-tool-python/commit/d36058626d09f37bb93b0ab53de7e455d3859f57#diff-811486ea533a433e85944e7e640f84a8

Otherwise I would recommend using the text-unidecode python library to normalize to ascii: https://pypi.org/project/text-unidecode/

>>> from text_unidecode import unidecode
>>> unidecode(reading.get("玉山"))
'yushan'

or we can try using https://github.com/mozillazg/python-pinyin which also supports number style pinyin yu4shan1. I want to add this to hanzi2reading but it will take me some more work as it requires parsing the pinyin or zhuyin data. Worth comparing the output on this dataset between hanzi2reading and py-pinyin for quality checks anyway.

alpha-rudy commented 4 years ago

Thanks, @bdon , I would try both way (--unicode and unidecode). ^^b

gromsy commented 4 years ago

But why did/do previous versions of Rudy's map already display correctly in pinyin in the past? I have been using a Garmin Fenix 5X with Rudy's map installed for a couple of years now and all place names are always displayed correctly with pinyin. Has something changed with recent versions?

You're probably using an APAC firmware that can display Chinese characters. I got mine in Europe and after compiling the bw version of the map, all names are just question marks.

No I'm not using APAC firmware. My version is US/North America and cannot display Chinese characters. If it could display Chinese characters, it wouldn't be showing pinyin.

Not sure what you mean by "compiling" the map. Have you tried: open Basecamp -> install Rudy's map -> connect your watch -> send the same map to your Fenix?

hegezolee commented 4 years ago

But why did/do previous versions of Rudy's map already display correctly in pinyin in the past? I have been using a Garmin Fenix 5X with Rudy's map installed for a couple of years now and all place names are always displayed correctly with pinyin. Has something changed with recent versions?

You're probably using an APAC firmware that can display Chinese characters. I got mine in Europe and after compiling the bw version of the map, all names are just question marks.

No I'm not using APAC firmware. My version is US/North America and cannot display Chinese characters. If it could display Chinese characters, it wouldn't be showing pinyin.

Not sure what you mean by "compiling" the map. Have you tried: open Basecamp -> install Rudy's map -> connect your watch -> send the same map to your Fenix?

Sorry, by compiling I meant to create the gmapsupp.img using the source from here. Interestingly, the maps downloaded from http://garmin.openstreetmap.nl/ show pinyin but they don't contain any topo layers.

alpha-rudy commented 4 years ago

Sorry, by compiling I meant to create the gmapsupp.img using the source from here. Interestingly, the maps downloaded from http://garmin.openstreetmap.nl/ show pinyin but they don't contain any topo layers.

Yes, @hegezolee , we understand. ^^

alpha-rudy commented 4 years ago

Hi, @poi890poi , with all the information provided by bdon, I will proceed this task. ^^

poi890poi commented 4 years ago

@alpha-rudy Got it. I am free today if you need any help.

bdon commented 4 years ago

FYI I added feature for tone-less pinyin. version 0.1.1 api is more advanced, use like this:

from hanzi2reading.reading import Reading
from hanzi2reading.pinyin import get as pinyin
from hanzi2reading.zhuyin import get as zhuyin

reading = Reading()
syllables = reading.get("我們一起去爬山吧")
print(' '.join(pinyin(s) for s in syllables))
print(' '.join(pinyin(s,tones=False) for s in syllables))
print(' '.join(zhuyin(s) for s in syllables))
wǒ men yī qǐ qù pá shān bā
wo men yi qi qu pa shan ba
ㄨㄛˇ ˙ㄇㄣ ㄧ ㄑㄧˇ ㄑㄩˋ ㄆㄚˊ ㄕㄢ ㄅㄚ

(last tone is wrong because no part-of-speech tagger, which is too complicated for this library)

alpha-rudy commented 4 years ago

Got it. Thank you, @bdon . I'm integrating your code into build process. ^^

bdon commented 4 years ago

Great, let me know if you need any more info or have suggestions, I am giving a presentation on this topic in August https://coscup.org/2020/en/agenda/JHFNBM so I can add to it ^^

My next task is to add an option to use the CC-CEDICT dictionary, in case the moedict license causes problems (it is CC-BY NoDerivatives 3.0).

alpha-rudy commented 4 years ago

Great, let me know if you need any more info or have suggestions, I am giving a presentation on this topic in August https://coscup.org/2020/en/agenda/JHFNBM so I can add to it ^^

Wow, look great! I would attend! ^^

alpha-rudy commented 4 years ago

Hi, @bdon , @poi890poi , @hegezolee , @gromsy , @kcwu ,

Thanks to @bdon , I have made a daily beta of the English encoded Garmin Maps.

To download these maps for: Garmin Handheld: https://map.happyman.idv.tw/rudy/drops/gmapsupp_Taiwan_moi_en_bw.img.zip Windows BaseCamp: https://map.happyman.idv.tw/rudy/drops/Install_MOI_Taiwan_TOPO_camp3D_en.exe macOS BaseCamp: https://map.happyman.idv.tw/rudy/drops/Taiwan_moi_en_camp3D.gmap.zip

I made the English name in the following three priorities:

  1. name:en
  2. name:zh_pinyin
  3. hanzi2reading.reading(name)

The result looks good. Please help me to check if there is any problem, and tell me if any improvement needed.

Thanks Rudy 7/3

Screen Shot 2020-07-03 at 12 06 30 PM
hegezolee commented 4 years ago

Hi @alpha-rudy,

It's perfect. Thank you so much for your hard work.

alpha-rudy commented 4 years ago

OK. Thanks! ^^

I would release to FB Group to have more tests. And if no big issue, I will make it in weekly release. ^^