Description:
We have scraped Tibetan creative writing data and data from CTA also, we need to compile the data in a single format with the necessary metadata, such as source, types and dates.
Subtask:
[x] Download the news data from s3
[x] Reformat the JSON structure of CTA and scrapped data suitable for hugging face dataset.
[x] Explore text cleaning and normalization
Completion Criteria:
Obtain compiled JSON file for text and metadata
Description: We have scraped Tibetan creative writing data and data from CTA also, we need to compile the data in a single format with the necessary metadata, such as source, types and dates.
Subtask:
Completion Criteria: Obtain compiled JSON file for text and metadata