Open ayaka14732 opened 1 year ago
This has been complaint by others for a long time https://twitter.com/chaakming/status/1555246138105614336
I guess nobody in the FLORES team knows Cantonese and Mandarin well enough to understand the unique situation of this language. The current data collected for yue is Hong Kong Chinese, NOT Cantonese. We recommend using this classifier to filter the real Cantonese data https://github.com/CanCLID/cantonese-classifier
The Cantonese (Yue Chinese,
yue_Hant
) data in FLORES-200 is completely wrong. The data is not Cantonese at all, but rather Mandarin Chinese in Traditional Chinese Script (zho_Hant
), which only has stylistic differences compared to thezho_Hant
data in the dataset.Furthermore, the paper mentioned that the
yue_Hant
andzho_Hant
data tend to be predicted as each other. It turns out that both datasets actually consist ofzho_Hant
data exclusively.yue_Hant
andzho_Hant
should actually be very easy to distinguish from each other.Here is how correct
yue_Hant
data would look like:eng_Latn
zho_Hant
yue_Hant
(wrong)yue_Hant
(corrected)(Bold denotes words that are used exclusively in
yue_Hant
)