Open albertvillanova opened 2 years ago
Hi @tosingithub, please note that this issue is not part of the first phase of the datasets hackathon: we are working now with Collections, that you can find in the Collections tab: https://github.com/orgs/bigscience-workshop/projects/2/views/7
Oh I see. Ok
The data file is a PDF document containing text in English and Yoruba.
I guess the task here is to "parse" the PDF content and extract only the dialogues in Yoruba.
Please, note there is another dataset in Yoruba on the Hub (for Machine translation to English): https://huggingface.co/datasets/menyo20k_mt
CC: @yjernite