Korean vs Chinese detection

GoogleCodeExporter commented 8 years ago

When trying to detect the language with the following input:

"保險公司條例草案月底首讀 
政府建議成立獨立保險業監管局"

it gives me Korean instead of Chinese as detection result with > 0.99999 
confidence.

This also happens with most other Chinese (zh-tw) texts, although sometimes 
zh-tw gets listed with about 0.15 confidence.

By removing the Korean profile, zh-tw correctly becomes the detection result 
with > 0.9999 confidence.

This seems odd, since the Korean profile is completely different from the given 
input string.

Original issue reported on code.google.com by andreas....@oximity.com on 16 Apr 2014 at 1:53

GoogleCodeExporter commented 8 years ago

I'm seeing the same issue with the string " 之前為 帳單交易作業區 
已變更 廣告內容 之前為 銷售代表 之前為 張貼日期為 
百分比之前為 合約 為 目標對象條件已刪除 
結束日期之前為" from the cld2 test suite.

To me is odd because the script is different.

Original comment by skr...@deezer.com on 12 Jun 2014 at 10:09

GoogleCodeExporter commented 8 years ago

Encountering the same thing with some Chinese text with optimaize's 
language-detector on github...

IDK, maybe encoding?

Original comment by dennis97...@gmail.com on 22 Jul 2015 at 1:01

aheadhim0207 / language-detection

Korean vs Chinese detection #66