Open frankier opened 6 years ago
Whoops! Looks like I just forgot to read the documentation properly.
@frankier If there's anything that could improve the docs, it's welcome! It's good to make the docs as helpful as possible
I think the OP was referring to the Unihan documentation for the zVariant field. I came here to file the same issue, but for the kSemanticVariant
field. The unihan website only displays the codepoint and character for this field, ignoring everything after the <
, so it's confusing to then find these strings in the JSON/YAML files.
I actually think that this issue should be re-opened, but with the goal of parsing these fields instead of just outputting the raw strings provided by Unihan.
For now, I have a post-processing step to make the field easier to use:
for d in unihan:
if sem_variants := d.get("kSemanticVariant", []):
new_sem_variant = []
for s in sem_variants:
codepoint = s.split("<")[0][2:]
char = chr(int(codepoint, 16))
new_sem_variant.append(char)
d["kSemanticVariant"] = new_sem_variant
@garfieldnate
I appreciate the code sample!
I'd like to make unihan-etl better help this scenario - but I'm not fully sure what's happening.
I came here to file the same issue, but for the kSemanticVariant field. The unihan website only displays the codepoint and character for this field, ignoring everything after the <, so it's confusing to then find these strings in the JSON/YAML files.
This sounds a bit ambiguous to me still, since we don't host the UNIHAN website or have affiliation with them (though I'm aware of it of course!) I haven't played with UNIHAN in year or two, so I'm a bit fuzzy!
unihan-etl downloads a dump of UNIHAN data and extracts from it. So in that regard, can you reframe this in context of the raw data dump, and how we can unihan-etl can help your use case as a user? I'd appreciate it even more! 😊
The unihan website only displays the codepoint and character for this field, ignoring everything after the <, so it's confusing to then find these strings in the JSON/YAML files.
(For general purposes in helping me understand the situation:)
By "The unihan website only displays", what display is this referring to? The table with info on kSemanticVariant?
The page when looking up a characater? e.g.
https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E7%B0%A1
Thanks for the quick reply! I find this project super useful, so I'm grateful for the work you put into it.
This sounds a bit ambiguous to me still, since we don't host the UNIHAN website or have affiliation with them The page when looking up a characater?
Yep, that page. I know you're not associated with them; I was just trying to explain how the misunderstanding happened in my case. The unihan.json file can be cumbersome to search all the time, so I regularly use the Unihan website. I thought that should match the downloaded file contents, but I was wrong.
In terms of how to update Unihan-ETL for this use-case, I would normalize/structure the output so that one string field generally contains one kind of data. For example, you've parsed the kIRGKangXi
field into a dictionary with several fields (page
, character
, virtual
) from a string that looks like 0929.300
, etc.
I know it's a lot of work to parse out all of the fields, but it is confusing to have to figure out which ones are parsed and which ones are not, so a note or warning about what is parsed and what is not would be useful. Although I think it would be better to parse all of them if possible :)
Here's the kSemanticVariant
example from the Unihan documentation:
As an example, U+3A17 has the kSemanticVariant value "U+6377<kHanYu:TZ". This means that, according to the Hanyu Da Zidian, U+3A17 and U+6377 have identical meaning and that U+6377 is the preferred form.
One way to encode this example would be:
[{
"捷": [{
"source": "kHanYu",
"flags": "TZ"
}]
}]
(I find it convenient to have the character there directly instead of the U+XXXX
string :) .)
The other fields that I've been manually parsing today are kSemanticVariant
, kZVariant
, kSimplifiedVariant
, kTraditionalVariant
, kJoyoKanji
and kJinmeiyoKanji
. But there are probably more to found.
@garfieldnate This was extraordinarily helpful! Thank you kindly!
I will take a closer look at improving parsing around kSemanticVariant
You are very welcome to make a pull request in the mean time of course
(I find it convenient to have the character there directly instead of the U+XXXX string :) .)
That is very interesting. Perhaps a possible future opportunity to add a flag to return the unicode glyph (e.g. 捷
) instead of the U+XXXX
symbol
@garfieldnate In the mean time I am also pinning this so I don't lose track of it 😊 I juggle quite a few projects at once
No hurry, as I already have workarounds for what I need :) The flag for returning characters instead of codepoints also sounds pretty nice! I won't open an issue for now, as I think this is mostly related to these as-yet-unparsed fields.
Hi,
I have found an instance of bad data in the database. I guess there could be more. Should the UniHan data be automatically cleaned before importing?