Open bahducoup opened 1 year ago
this shouldnt affect the reconstruction cuz in the dataloader, we take the first pronunciation variant (kɤŋ˥ in this case)
thanks for catching this tho!!
the issue is that the romanized version (the pre-parsed version on Wiktionary) shows something like this "gēng/jīng/the time sense” for Mandarin
we need to remove extra annotations for Mandarin in the Wiktionary parsing script
更 kæŋ¹ kaːŋ˥ - kaŋ˨˦ t͡ɕĩŋ˩˩ kɤŋ˥/t͡ɕiŋ˥/tʰxɤ tʰimɤ sənsɤ kĩ˥/kẽ˥ - -
When this row is split on
'\t'
,kɤŋ˥/t͡ɕiŋ˥/tʰxɤ tʰimɤ sənsɤ
is treated as a single token. Thetʰimɤ sənsɤ
portion of this token seems to be erroneously included in the row.I think it would be a good idea to check why these characters were included in the dataset and verify that there are no similar errors in the rest of the dataset.