Closed hegezolee closed 4 years ago
Hi, @hegezolee , after a few surveys, I found it could be done.
It would need python's osmium and pinyin related modules to convert a pbf in Traditional Chinese to Pinyi.
I'm wondering if you could help and do the work? :)
BTW, I'm also looking for help in our facebook group. XD
MOE moedict may have more Taiwan place names, but CC-CEDICT may also be an option. The licenses are slightly different but one or both may be OK. I think the python pinyin module uses a hardcoded dictionary that does not have many place names. An approach may be to compile moedict to a lookup-friendly format that can resolve a name:
tag and potential 破音字 using longest substring match. I have been working on this approach here https://github.com/bdon/hanzi2reading
I may look into building a python module for this that could be integrated seamlessly with pyosmium. A benefit of this design could be that it also supports other systems such as Bopomofo, Gwoyeu Romatzyh, etc.
But why did/do previous versions of Rudy's map already display correctly in pinyin in the past? I have been using a Garmin Fenix 5X with Rudy's map installed for a couple of years now and all place names are always displayed correctly with pinyin.
Has something changed with recent versions?
But why did/do previous versions of Rudy's map already display correctly in pinyin in the past? I have been using a Garmin Fenix 5X with Rudy's map installed for a couple of years now and all place names are always displayed correctly with pinyin.
Has something changed with recent versions?
You're probably using an APAC firmware that can display Chinese characters. I got mine in Europe and after compiling the bw version of the map, all names are just question marks.
MOE moedict may have more Taiwan place names, but CC-CEDICT may also be an option. The licenses are slightly different but one or both may be OK. I think the python pinyin module uses a hardcoded dictionary that does not have many place names. An approach may be to compile moedict to a lookup-friendly format that can resolve a
name:
tag and potential 破音字 using longest substring match. I have been working on this approach here https://github.com/bdon/hanzi2readingI may look into building a python module for this that could be integrated seamlessly with pyosmium. A benefit of this design could be that it also supports other systems such as Bopomofo, Gwoyeu Romatzyh, etc.
tag @poi890poi
Had the chance to work on this a bit today -you can do pip install hanzi2reading
(now version 0.0.3) which will give you a command line utility, that takes a character string as input
hanzi2reading 行銀行
> xíngyínháng
This is based on https://github.com/g0v/moedict-data so the database license there would apply.
You can use this directly as a python library. Here is a pyosmium script that will convert a osm.pbf to a new one, adding the aux:pinyin
for every name
tag. You can grab a small minutely area pbf from https://protomaps.com/extracts
import osmium
import sys
from hanzi2reading import Reading
reading = Reading()
def annotate(obj):
if 'name' in obj.tags:
new_obj = obj.replace()
d = dict(obj.tags)
d['aux:pinyin'] = reading.get(d['name'])
new_obj.tags = d
return new_obj
else:
return obj
class Handler(osmium.SimpleHandler):
def __init__(self, writer):
super(Handler,self).__init__()
self.writer = writer
def node(self,n):
self.writer.add_node(annotate(n))
def way(self, w):
self.writer.add_way(annotate(w))
def relation(self,r):
self.writer.add_relation(annotate(r))
writer = osmium.SimpleWriter(sys.argv[2])
Handler(writer).apply_file(sys.argv[1])
writer.close()
My implementation is very primitive and probably buggy, it only does a greedy prefix search over the moedict entries. I would be interested in if you could QA the output and compare to other pinyin libraries. Also note that proper pinyin should have segmentation and proper nouns, so táiběishì should be Táiběi shì, but I don't know how that important that is for this application (I may consider adding this to hanzi2reading, but it depends on having reliable data.)
Wow, @bdon , superb, thanks!
Let me try and integrated it into the build process. ^^
Hi, @bdon , I had a few tests, it works great. ^^
>>> from hanzi2reading import Reading
>>> reading = Reading()
>>> reading.get('來來')
'láilai'
>>> reading.get('台北車站')
'táiběichēzhàn'
>>> reading.get('玉山')
'yùshān'
>>> reading.get('無雙社')
'wúshuāngshè'
>>> reading.get('白冷圳')
'báilěngzùn'
Garmin has a protection policy. It inhibits the Unicode map without protection lock. At the calling to 'mkgmap,' I could only use code page 1252 for English and code page 950 Traditional Chinese.
Code page 1252: https://en.wikipedia.org/wiki/Windows-1252 Code page 950: https://en.wikipedia.org/wiki/Code_page_950
In case I need to convert to ASCII, could hanzi2reading output characters without tone marker (ex: 'taibeichezhan')?
Thanks!
Is this a mkgmap issue or Garmin region lock issue? I think mkgmap can take a --unicode
option like I did here: https://github.com/hotosm/osm-export-tool-python/commit/d36058626d09f37bb93b0ab53de7e455d3859f57#diff-811486ea533a433e85944e7e640f84a8
Otherwise I would recommend using the text-unidecode python library to normalize to ascii: https://pypi.org/project/text-unidecode/
>>> from text_unidecode import unidecode
>>> unidecode(reading.get("玉山"))
'yushan'
or we can try using https://github.com/mozillazg/python-pinyin which also supports number style pinyin yu4shan1. I want to add this to hanzi2reading but it will take me some more work as it requires parsing the pinyin or zhuyin data. Worth comparing the output on this dataset between hanzi2reading and py-pinyin for quality checks anyway.
Thanks, @bdon , I would try both way (--unicode
and unidecode
). ^^b
But why did/do previous versions of Rudy's map already display correctly in pinyin in the past? I have been using a Garmin Fenix 5X with Rudy's map installed for a couple of years now and all place names are always displayed correctly with pinyin. Has something changed with recent versions?
You're probably using an APAC firmware that can display Chinese characters. I got mine in Europe and after compiling the bw version of the map, all names are just question marks.
No I'm not using APAC firmware. My version is US/North America and cannot display Chinese characters. If it could display Chinese characters, it wouldn't be showing pinyin.
Not sure what you mean by "compiling" the map. Have you tried: open Basecamp -> install Rudy's map -> connect your watch -> send the same map to your Fenix?
But why did/do previous versions of Rudy's map already display correctly in pinyin in the past? I have been using a Garmin Fenix 5X with Rudy's map installed for a couple of years now and all place names are always displayed correctly with pinyin. Has something changed with recent versions?
You're probably using an APAC firmware that can display Chinese characters. I got mine in Europe and after compiling the bw version of the map, all names are just question marks.
No I'm not using APAC firmware. My version is US/North America and cannot display Chinese characters. If it could display Chinese characters, it wouldn't be showing pinyin.
Not sure what you mean by "compiling" the map. Have you tried: open Basecamp -> install Rudy's map -> connect your watch -> send the same map to your Fenix?
Sorry, by compiling I meant to create the gmapsupp.img using the source from here. Interestingly, the maps downloaded from http://garmin.openstreetmap.nl/ show pinyin but they don't contain any topo layers.
Sorry, by compiling I meant to create the gmapsupp.img using the source from here. Interestingly, the maps downloaded from http://garmin.openstreetmap.nl/ show pinyin but they don't contain any topo layers.
Yes, @hegezolee , we understand. ^^
Hi, @poi890poi , with all the information provided by bdon, I will proceed this task. ^^
@alpha-rudy Got it. I am free today if you need any help.
FYI I added feature for tone-less pinyin. version 0.1.1 api is more advanced, use like this:
from hanzi2reading.reading import Reading
from hanzi2reading.pinyin import get as pinyin
from hanzi2reading.zhuyin import get as zhuyin
reading = Reading()
syllables = reading.get("我們一起去爬山吧")
print(' '.join(pinyin(s) for s in syllables))
print(' '.join(pinyin(s,tones=False) for s in syllables))
print(' '.join(zhuyin(s) for s in syllables))
wǒ men yī qǐ qù pá shān bā
wo men yi qi qu pa shan ba
ㄨㄛˇ ˙ㄇㄣ ㄧ ㄑㄧˇ ㄑㄩˋ ㄆㄚˊ ㄕㄢ ㄅㄚ
(last tone is wrong because no part-of-speech tagger, which is too complicated for this library)
Got it. Thank you, @bdon . I'm integrating your code into build process. ^^
Great, let me know if you need any more info or have suggestions, I am giving a presentation on this topic in August https://coscup.org/2020/en/agenda/JHFNBM so I can add to it ^^
My next task is to add an option to use the CC-CEDICT dictionary, in case the moedict license causes problems (it is CC-BY NoDerivatives 3.0).
Great, let me know if you need any more info or have suggestions, I am giving a presentation on this topic in August https://coscup.org/2020/en/agenda/JHFNBM so I can add to it ^^
Wow, look great! I would attend! ^^
Hi, @bdon , @poi890poi , @hegezolee , @gromsy , @kcwu ,
Thanks to @bdon , I have made a daily beta of the English encoded Garmin Maps.
To download these maps for: Garmin Handheld: https://map.happyman.idv.tw/rudy/drops/gmapsupp_Taiwan_moi_en_bw.img.zip Windows BaseCamp: https://map.happyman.idv.tw/rudy/drops/Install_MOI_Taiwan_TOPO_camp3D_en.exe macOS BaseCamp: https://map.happyman.idv.tw/rudy/drops/Taiwan_moi_en_camp3D.gmap.zip
I made the English name in the following three priorities:
The result looks good. Please help me to check if there is any problem, and tell me if any improvement needed.
Thanks Rudy 7/3
Hi @alpha-rudy,
It's perfect. Thank you so much for your hard work.
OK. Thanks! ^^
I would release to FB Group to have more tests. And if no big issue, I will make it in weekly release. ^^
Hi Rudy,
Is there a way to compile this so that places are displayed in pinyin and not traditional Chinese? My European bought Fenix 6 can't display Chinese characters and the generated gmapsupp only has question marks.
Thanks in advance.