The Cantonese (Yue Chinese, `yue_Hant`) data in FLORES-200 is not Cantonese at all

The Cantonese (Yue Chinese, yue_Hant) data in FLORES-200 is completely wrong. The data is not Cantonese at all, but rather Mandarin Chinese in Traditional Chinese Script (zho_Hant), which only has stylistic differences compared to the zho_Hant data in the dataset.

Furthermore, the paper mentioned that the yue_Hant and zho_Hant data tend to be predicted as each other. It turns out that both datasets actually consist of zho_Hant data exclusively. yue_Hant and zho_Hant should actually be very easy to distinguish from each other.

Here is how correct yue_Hant data would look like:

Language Code	Sentence
`eng_Latn`	They found the Sun operated on the same basic principles as other stars: The activity of all stars in the system was found to be driven by their luminosity, their rotation, and nothing else.
`zho_Hant`	他們發現太陽的運作與其他恆星的基本原理相同：系統中所有恆星的活動均受其光度、自轉所推動，就是這麼簡單。
`yue_Hant` (wrong)	他們發現，太陽和其他恆星的運行原理是一樣的：系統中所有恆星的活動都是由它們的亮度、自轉驅動的，而並非其他因素。
`yue_Hant` (corrected)	佢哋發現，太陽同其他恆星嘅運行原理冇分別：系統入面所有恆星嘅活動都淨係由佢哋嘅亮度同自轉推動，而唔包括其他因素。

(Bold denotes words that are used exclusively in yue_Hant)

facebookresearch / flores

The Cantonese (Yue Chinese, `yue_Hant`) data in FLORES-200 is not Cantonese at all #61