ReaLLMASIC / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
24 stars 18 forks source link

Updated scripts to extract zh snac tokens #312

Closed xinyixuu closed 48 minutes ago

xinyixuu commented 1 day ago

Upload two files.

snac_text_zh.py runs the program to get snac tokens

get_zh_snac.sh contains scripts to download database from hugging face and stored all the json output to "json_outs"

The following command runs the whole process: bash get_zh_snac.sh

Note: The database we use (https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) requires authentication. Before running the bash script, remember to input the hugging face tokens to the get_zh_snac.sh on line 13

P.S.: You can find and create your tokens here: https://huggingface.co/settings/tokens. "Token Type" of "Read" is recommended.

gkielian commented 48 minutes ago

Looks good