PMA0012: Compile Scraped and CTA Creative Writing Data - Githubissues

OpenPecha / compile-writing-data

MIT License

0 stars 0 forks source link

PMA0012: Compile Scraped and CTA Creative Writing Data #2

Open 10kalden opened 1 month ago

10kalden commented 1 month ago

Description: We have scraped Tibetan creative writing data and data from CTA also, we need to compile the data in a single format with the necessary metadata, such as source, types and dates.

Subtask:

[x] Download the news data from s3
[x] Reformat the JSON structure of CTA and scrapped data suitable for hugging face dataset.
[x] Explore text cleaning and normalization

Completion Criteria: Obtain compiled JSON file for text and metadata

10kalden commented 1 month ago

Uploaded creative writing data to s3
Data size: 1.78 GB
s3://creative.writing.data