[Super AI] Build Main Pipeline for all data

new5558 commented 1 year ago

This task composes of the following parts:

Convert each dataset to JSONL with the following format

{
    "text": "Data",
    "source": "Source of the data",
    "source_id": "id of the original item in source data",
    "created_date": "Created date",
    "updated_date": "Updated date"
    "meta": "Things that are specific to the source data ex: File name, Pantip Tags"
}

ex.

{
    "text": "Hello\nWorld",
    "source": "PANTIP",
    "source_id": "234233",
    "created_date": "2011-10-05T14:48:00.000Z",
    "updated_date": "2011-10-05T14:48:00.000Z"
    "meta": "{'filename': '234.jsonl'}"
}

The data we should convert:

English Dataset
- The Pile
Thai Dataset
- all cleaned OSCAR
- cleaned mC4
- cleaned CC100
- all cleaned Pantip
- others (On P Aut Sheet)

Note: cleaned OSCAR, mC4, Pantip can be discussed with @Chawak Note2: cleaned Pantip can be discussed with @boss-chanon Note3: The date should be in ISOString format

Write a script that merges all JSONL files (Except The Pile) into a single Huggingface dataset with the metadata preserved ex. https://huggingface.co/datasets/EleutherAI/pile

Note: We should split train, test

Write a code to merge The Pile and our dataset into single Huggingface dataset that can specified portion of dataset to add (ex. Thai 200%, The Pile 16.6%)

Chawak commented 1 year ago

Does created_date means the date that original dataset was created krub ?

new5558 commented 1 year ago

created_date mean day that original dataset is created if available, if not this will be the date that original dataset is obtained (ex. scraped from ... website on 2023 June 6) @Chawak

OpenThaiGPT / openthaigpt-pretraining

[Super AI] Build Main Pipeline for all data #146