bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 49 forks source link

Create dataset OSIAN #281

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago

Source: Masader Project

cakiki commented 2 years ago

@albertvillanova : Seems that the data was never available for download (see this snapshot from last January: https://web.archive.org/web/20210125033548/http://oujda-nlp-team.net/en/corpora/osian-corpus/)

image

apergo-ai commented 2 years ago

self-assign

apergo-ai commented 2 years ago

The data is available here: https://clarin.informatik.uni-leipzig.de/de?corpusId=ara-international_newscrawl-OSIAN_2018

But it is part of a much larger corpus, which I will try to get into the project.

albertvillanova commented 2 years ago

This needs to be clarified: whether this dataset is accessible and which are the corresponding files.