The UDHR-LID dataset is a refined version of the Universal Declaration of Human Rights, tailored for language identification tasks. It removes filler texts, repeated phrases, and inaccuracies from the original UDHR, focusing only on cleaned paragraphs. Each entry in the dataset is associated with a specific language, providing long, linguistically rich content. This dataset is particularly useful for non-parallel, language-specific text analysis in natural language processing.
Dataloader name:
udhr_lid/udhr_lid.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?udhr_lid