direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy
MIT License
2 stars 0 forks source link

handle character variants when loading data #7

Open thatbudakguy opened 2 years ago

thatbudakguy commented 2 years ago

see https://github.com/direct-phonology/core/blob/6a800a3201de43c039a6f7f096aef3a65a843922/core/bin/gentable.py#L84-L134

thatbudakguy commented 2 years ago

see also spacy's own docs on data augmentation which lets you swap variants to get a more robust training process. our existing variant file could be converted into this format pretty easily, and then we could use the orth_variants augmenter or define our own.