cmu-llab / wikihan

Creative Commons Zero v1.0 Universal
11 stars 1 forks source link

Error in entry for 更 #2

Open bahducoup opened 1 year ago

bahducoup commented 1 year ago

更 kæŋ¹ kaːŋ˥ - kaŋ˨˦ t͡ɕĩŋ˩˩ kɤŋ˥/t͡ɕiŋ˥/tʰxɤ tʰimɤ sənsɤ kĩ˥/kẽ˥ - -

When this row is split on '\t', kɤŋ˥/t͡ɕiŋ˥/tʰxɤ tʰimɤ sənsɤ is treated as a single token. The tʰimɤ sənsɤ portion of this token seems to be erroneously included in the row.

I think it would be a good idea to check why these characters were included in the dataset and verify that there are no similar errors in the rest of the dataset.

kalvinchang commented 1 year ago

this shouldnt affect the reconstruction cuz in the dataloader, we take the first pronunciation variant (kɤŋ˥ in this case)

thanks for catching this tho!!

kalvinchang commented 1 year ago

the issue is that the romanized version (the pre-parsed version on Wiktionary) shows something like this "gēng/jīng/the time sense” for Mandarin

we need to remove extra annotations for Mandarin in the Wiktionary parsing script