albertvillanova commented 2 years ago

uid: yoruba_dialogues_in_different_domains
type: processed
description:
- name: Yoruba dialogues in different domains
- description: Yoruba dialogues in different domains curated from two sources: https://coerll.utexas.edu/yemi/pdfs/YorubaYeMi-textbook.pdf and https://www.theyorubablog.com/bibe-eko-wo-fun-ose-kan-one-week-visit-in-lagos
- homepage: https://coerll.utexas.edu/yemi/pdfs/YorubaYeMi-textbook.pdf
- validated: True
languages:
- language_names:
- Niger-Congo
- Yoruba
- language_comments:
- language_locations:
- World-Wide
- Nigeria
- United States of America
- validated: False
custodian:
- name: Fehintola Mosadomi
- in_catalogue:
- type: A private individual
- location: United States of America
- contact_name: Fehintola Mosadomi
- contact_email:
- contact_submitter: False
- additional: https://coerll.utexas.edu/yemi/pdfs/YorubaYeMi-textbook.pdf
- validated: False
availability:
- procurement:
- for_download: Yes - it has a direct download link or links
- download_url: https://coerll.utexas.edu/yemi/pdfs/YorubaYeMi-textbook.pdf
- download_email:
- licensing:
- has_licenses: Yes
- license_text:
- license_properties:
  - open license
- license_list:
  - cc-by-3.0-us: Creative Commons Attribution 3.0 United States
- pii:
- has_pii: Yes - text author name only
- generic_pii_likely:
- generic_pii_list:
- numeric_pii_likely:
- numeric_pii_list:
- sensitive_pii_likely:
- sensitive_pii_list:
- no_pii_justification_class: fictional text
- no_pii_justification_text:
- validated: False
processed_from_primary:
- from_primary: Original data
- primary_availability:
- primary_license:
- primary_types:
- validated: False
media:
- category:
- text
- text_format:
- .PDF
- .HTML
- audiovisual_format:
- image_format:
- database_format:
- text_is_transcribed: No
- instance_type: book
- instance_count: 100<n<1K
- instance_size: 100<n<10,000
- validated: False
fname: yoruba_dialogues_in_different_domains.json

tosingithub commented 2 years ago

self-assign

albertvillanova commented 2 years ago

Hi @tosingithub, please note that this issue is not part of the first phase of the datasets hackathon: we are working now with Collections, that you can find in the Collections tab: https://github.com/orgs/bigscience-workshop/projects/2/views/7

tosingithub commented 2 years ago

Oh I see. Ok

albertvillanova commented 2 years ago

The data file is a PDF document containing text in English and Yoruba.

I guess the task here is to "parse" the PDF content and extract only the dialogues in Yoruba.

Please, note there is another dataset in Yoruba on the Hub (for Machine translation to English): https://huggingface.co/datasets/menyo20k_mt

Maybe worth adding it?

CC: @yjernite

bigscience-workshop / data_tooling

Create dataset yoruba_dialogues_in_different_domains #106

self-assign