cihai / unihan-etl

Export UNIHAN's database to csv, json or yaml
https://unihan-etl.git-pull.com
MIT License
52 stars 13 forks source link

option to save as a dictionary instead of a list #233

Open garfieldnate opened 3 years ago

garfieldnate commented 3 years ago

I have found that I always need to convert the data into a dictionary (instead of the default list) when I'm using it. Because of this, I decided to always store the file in dictionary format. My method for doing so is a bit hacky, and it would be great to have a --structure <dict|list> or even --dictionary parameter to do this within unihan_etl.

Here's my current code. It relies on the undocumented python formatting option:

from unihan_etl.process import Packager as unihan_packager
from unihan_etl.process import export_json

def unihan_download(unihan_file):
    # destination argument is required even though the packager will not write the file
    p = unihan_packager.from_cli(["-F", "json", "--destination", unihan_file])
    p.download()
    # instruct packager to return data instead of writing to file
    p.options["format"] = "python"
    unihan = p.export()

    # convert from list to dictionary
    unihan_dict = {entry["char"]: entry for entry in unihan}

    export_json(unihan_dict, unihan_file)
tony commented 2 years ago

@garfieldnate I missed this message! Sorry about that!

Is there anything I can do at this time? Looks like you have stuff going on here https://github.com/garfieldnate/uniunihan-db

garfieldnate commented 2 years ago

Thanks for noticing :D I obviously have a workaround already, but I do still think that a --dictionary option would make unihan-etl more useful. No worries if you can't get to it, as my workaround is fine for me. Thanks for the great library!

tony commented 2 years ago

@garfieldnate We can add it, and also make it available via Python API

garfieldnate commented 9 months ago

In the most recent unihan_etl the code I pasted above fails with this error. Not sure if my usage of the API is wrong or if there's an issue in the library.

.venv/lib/python3.11/site-packages/unihan_etl/process.py:531: in export
    data = expand_delimiters(data)
.venv/lib/python3.11/site-packages/unihan_etl/process.py:406: in expand_delimiters
    char[field] = expansion.expand_field(field, char[field])
.venv/lib/python3.11/site-packages/unihan_etl/expansion.py:416: in expand_field
    return expansion_func(fvalue)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

value = [{'radical': 5, 'simplified': False, 'strokes': 10}, "213''.0"]

    def _expand_kRSGeneric(value):
        pattern = re.compile(
            r"""
            (?P<radical>[1-9][0-9]{0,2})
            (?P<simplified>\'?)\.
            (?P<strokes>-?[0-9]{1,2})
        """,
            re.X,
        )

        for i, v in enumerate(value):
>           m = pattern.match(v).groupdict()
E           AttributeError: 'NoneType' object has no attribute 'groupdict'

.venv/lib/python3.11/site-packages/unihan_etl/expansion.py:332: AttributeError
tony commented 9 months ago

@garfieldnate Thank you!

Does wiping cache and the DB file and rerunning change anything?

garfieldnate commented 9 months ago

That was a really fast response :D

This is actually my bad; the latest unihan_etl already has a fix for this in place, and I mistakenly thought I had updated.

The issue is a typo in the kRSUnicode field for 亀: https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E4%BA%80. It has two apostrophes, which does not follow the syntax specified in the standard. unihan_etl has already updated its parsing to allow the second apostrophe.

I did have to update my code for some unihan_etl changes, but nothing crazy.

tony commented 9 months ago

@garfieldnate Thank you for the added information. I created an issue in case anyone bumps into this issue to let them know updating works!