Clean bad data? - Githubissues

frankier commented 6 years ago

Hi,

I have found an instance of bad data in the database. I guess there could be more. Should the UniHan data be automatically cleaned before importing?

from cihai.core import Cihai
from cihai.bootstrap import bootstrap_unihan

cihan = Cihai()
if not cihan.is_bootstrapped:
    bootstrap_unihan(cihan.metadata)

cihan.reflect_db()

c = cihan.lookup_char('任').first()
print(c.kZVariant)

$ python bad_data.py
U+4EFC<kHKGlyph

frankier commented 6 years ago

Whoops! Looks like I just forgot to read the documentation properly.

tony commented 6 years ago

@frankier If there's anything that could improve the docs, it's welcome! It's good to make the docs as helpful as possible

garfieldnate commented 3 years ago

I think the OP was referring to the Unihan documentation for the zVariant field. I came here to file the same issue, but for the kSemanticVariant field. The unihan website only displays the codepoint and character for this field, ignoring everything after the <, so it's confusing to then find these strings in the JSON/YAML files.

I actually think that this issue should be re-opened, but with the goal of parsing these fields instead of just outputting the raw strings provided by Unihan.

garfieldnate commented 3 years ago

For now, I have a post-processing step to make the field easier to use:

for d in unihan:
    if sem_variants := d.get("kSemanticVariant", []):
        new_sem_variant = []
        for s in sem_variants:
            codepoint = s.split("<")[0][2:]
            char = chr(int(codepoint, 16))
            new_sem_variant.append(char)
        d["kSemanticVariant"] = new_sem_variant

tony commented 3 years ago

@garfieldnate

I appreciate the code sample!

I'd like to make unihan-etl better help this scenario - but I'm not fully sure what's happening.

I came here to file the same issue, but for the kSemanticVariant field. The unihan website only displays the codepoint and character for this field, ignoring everything after the <, so it's confusing to then find these strings in the JSON/YAML files.

This sounds a bit ambiguous to me still, since we don't host the UNIHAN website or have affiliation with them (though I'm aware of it of course!) I haven't played with UNIHAN in year or two, so I'm a bit fuzzy!

unihan-etl downloads a dump of UNIHAN data and extracts from it. So in that regard, can you reframe this in context of the raw data dump, and how we can unihan-etl can help your use case as a user? I'd appreciate it even more! 😊

The unihan website only displays the codepoint and character for this field, ignoring everything after the <, so it's confusing to then find these strings in the JSON/YAML files.

(For general purposes in helping me understand the situation:)

By "The unihan website only displays", what display is this referring to? The table with info on kSemanticVariant?

The page when looking up a characater? e.g.

https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E7%B0%A1

garfieldnate commented 3 years ago

Thanks for the quick reply! I find this project super useful, so I'm grateful for the work you put into it.

This sounds a bit ambiguous to me still, since we don't host the UNIHAN website or have affiliation with them The page when looking up a characater?

Yep, that page. I know you're not associated with them; I was just trying to explain how the misunderstanding happened in my case. The unihan.json file can be cumbersome to search all the time, so I regularly use the Unihan website. I thought that should match the downloaded file contents, but I was wrong.

In terms of how to update Unihan-ETL for this use-case, I would normalize/structure the output so that one string field generally contains one kind of data. For example, you've parsed the kIRGKangXi field into a dictionary with several fields (page, character, virtual) from a string that looks like 0929.300, etc.

I know it's a lot of work to parse out all of the fields, but it is confusing to have to figure out which ones are parsed and which ones are not, so a note or warning about what is parsed and what is not would be useful. Although I think it would be better to parse all of them if possible :)

Here's the kSemanticVariant example from the Unihan documentation:

As an example, U+3A17 has the kSemanticVariant value "U+6377<kHanYu:TZ". This means that, according to the Hanyu Da Zidian, U+3A17 and U+6377 have identical meaning and that U+6377 is the preferred form.

One way to encode this example would be:

[{
    "捷": [{
        "source": "kHanYu",
        "flags": "TZ"
    }]
}]

(I find it convenient to have the character there directly instead of the U+XXXX string :) .)

The other fields that I've been manually parsing today are kSemanticVariant, kZVariant, kSimplifiedVariant, kTraditionalVariant, kJoyoKanji and kJinmeiyoKanji. But there are probably more to found.

tony commented 3 years ago

@garfieldnate This was extraordinarily helpful! Thank you kindly!

I will take a closer look at improving parsing around kSemanticVariant

You are very welcome to make a pull request in the mean time of course

(I find it convenient to have the character there directly instead of the U+XXXX string :) .)

That is very interesting. Perhaps a possible future opportunity to add a flag to return the unicode glyph (e.g. 捷) instead of the U+XXXX symbol

tony commented 3 years ago

@garfieldnate In the mean time I am also pinning this so I don't lose track of it 😊 I juggle quite a few projects at once

garfieldnate commented 3 years ago

No hurry, as I already have workarounds for what I need :) The flag for returning characters instead of codepoints also sounds pretty nice! I won't open an issue for now, as I think this is mostly related to these as-yet-unparsed fields.

cihai / unihan-etl

Clean bad data? #80