bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset yoruba_dialogues_in_different_domains #106

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago
tosingithub commented 2 years ago

self-assign

albertvillanova commented 2 years ago

Hi @tosingithub, please note that this issue is not part of the first phase of the datasets hackathon: we are working now with Collections, that you can find in the Collections tab: https://github.com/orgs/bigscience-workshop/projects/2/views/7

tosingithub commented 2 years ago

Oh I see. Ok

albertvillanova commented 2 years ago

The data file is a PDF document containing text in English and Yoruba.

I guess the task here is to "parse" the PDF content and extract only the dialogues in Yoruba.

Please, note there is another dataset in Yoruba on the Hub (for Machine translation to English): https://huggingface.co/datasets/menyo20k_mt

CC: @yjernite