facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark
Other
705 stars 123 forks source link

Topics #39

Open jaspock opened 2 years ago

jaspock commented 2 years ago

According to the FLORES-101 paper, "we manually labeled all sentences by a more detailed sub-topic, one of 10 possibilities: crime, disasters, entertainment, geography, health, nature, politics, science, sports, and travel". Table 1 in the paper includes the statistics of these different sub-topics. However, in the metadata files there is a much larger number of sub-topics (actually, 306) such as:

Accident
accidents
accordion/right hand
advanced interactive media
Alchohol
American education/forgotten half/Foster care
American education/Special Needs ADD
...
ancient china/government
Ancient Civilizations/Romans
Ancient_Civilizations/Assyrians
...
big cats
big cats, lion
big cats, ocelot
big cats, tiger
Blended Learning/Blogging
Blended Learning/Field trips
Bugs/Insects_Intro
business
castles of england/tudor castles
castles of english/development of castles
climate
...

Is the 10-class metadata available for download or some recomendations on how to group the existing ones into a smaller number of topics?

The list of the 306 topics may be asily obtained with:

cat metedata_dev*|cut -f 3|sort| uniq