7468696e6b / kangxiDictText

Converted kangxi dictionary to text file, for use in Pleco and other dictionary apps. (re-uploaded 2020-06-29)
Other
9 stars 0 forks source link

Slight formatting inconsistencies and error at Line 11791 (歹) #1

Open BSBMteam opened 4 years ago

BSBMteam commented 4 years ago

Wonderful project~ I'm in the process of gathering data from this database!

There's a series of commas and iteration marks. This may be an error. See below! 歹 《康熙字典》〈辰集下〉【歹字部】頁578第15 〔古文〕:𡰮、。、同、𣦵、,、俗、省、。、本、作、𣦵、,、隷、作、歺、。、>【俗書正誤】歹,音遏。【長箋】今誤讀等在切,爲好字之反。𣦵字原从卜从冂作。

There are some thick left brackets (】 ) that have a space after it. These spaces should be deleted for consistency. In fact, all spaces should be deleted. For example, all titles with spaces in between, such as 【禮 曲禮】 〈補遺 巳集〉 should be formatted with a · like so: 【禮·曲禮】 〈補遺·巳集〉

7468696e6b commented 4 years ago

Thanks for pointing that out. The file was converted from a StarDict file using regex (original stardict file here, not my work: https://simonwiles.net/projects/kangxi-zidian/). Looking at the original file it seems the database itself was corrupted. You can view the original scanned data here: http://www.kangxizidian.com/kangxi/0578.gif. You can try alternate databases such as:

Good luck!

If you want to fork or contribute you're also welcome to do so :)

7468696e6b commented 4 years ago

For now I've manually edited and re-uploaded the file as kangxizidian-v3f.txt, correcting the error you've pointed out. Let me know if you find any other errors! Thanks. 感謝您的幫助

7468696e6b commented 4 years ago

I'll look into the formatting issue, keep in mind it's designed for Pleco use. I think it would be a little more difficult to add the · character by simple regex. Hope the other databases are also of use to your research!

BSBMteam commented 4 years ago

Thanks! I'll look into that

I've also noticed that many instances of this character is an unknown character: ?

A quick notepad++ search returns 11 values (duplicates means there's 2 for that entry):

劉 | 劉 | 恖 | 恖 | 洭 | 洭 | 謧 | 贔 | 贔 | 𥒯 | 𧦮
7468696e6b commented 4 years ago

Could you clarify what you mean by the duplicates or unknown character? I only see one result when I search for (using 2 tab characters after the character 劉). I get this result:

劉       《康熙字典》〈子集下〉【刀字部】頁144第39    〔古文〕:鎦、𠭱【唐韻】【集韻】【韻會】【正韻】𠀤力求切,音留。【說文】殺也。【書·盤庚】重我民,無盡劉。【詩·周頌】勝殷遏劉。【左傳·成十三年】䖍劉我邊陲。又【爾雅·釋詁】劉,𨻰也。【疏】謂敷𨻰也。又【爾雅·釋詁】劉,㬥樂也。【疏】木枝葉稀疎不均爲㬥樂。【詩·大雅】捋采其劉。【毛傳】劉,爆爍而希也。又【爾雅·釋木】劉,劉杙。【註】劉子生山中。【疏】劉一名劉杙,其子可食。又姓。【韻會】凡二十五望,𠀤自陶唐氏劉累之後。又【集韻】力九切,留上聲。好也。又【集韻】龍珠切,音鏤。殺也。漢禮,立秋有貙劉。又【同文備考】作??。

Also maybe try installing the fonts if you're on desktop (see the README.md file), usually for Windows you can drag them into C:\Windows\Fonts or simply double click the downloaded TTF files and press install. This might help with displaying unknown characters. If not, feel free to let me know, as it might be an error in the database.

7468696e6b commented 4 years ago

These are the fonts: