OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

[Super AI] Build Main Pipeline for all data #146

Open new5558 opened 1 year ago

new5558 commented 1 year ago

This task composes of the following parts:

  1. Convert each dataset to JSONL with the following format
{
    "text": "Data",
    "source": "Source of the data",
    "source_id": "id of the original item in source data",
    "created_date": "Created date",
    "updated_date": "Updated date"
    "meta": "Things that are specific to the source data ex: File name, Pantip Tags"
}

ex.

{
    "text": "Hello\nWorld",
    "source": "PANTIP",
    "source_id": "234233",
    "created_date": "2011-10-05T14:48:00.000Z",
    "updated_date": "2011-10-05T14:48:00.000Z"
    "meta": "{'filename': '234.jsonl'}"
}

The data we should convert:

Note: cleaned OSCAR, mC4, Pantip can be discussed with @Chawak Note2: cleaned Pantip can be discussed with @boss-chanon Note3: The date should be in ISOString format

  1. Write a script that merges all JSONL files (Except The Pile) into a single Huggingface dataset with the metadata preserved ex. https://huggingface.co/datasets/EleutherAI/pile

Note: We should split train, test

  1. Write a code to merge The Pile and our dataset into single Huggingface dataset that can specified portion of dataset to add (ex. Thai 200%, The Pile 16.6%)
Chawak commented 1 year ago

Does created_date means the date that original dataset was created krub ?

new5558 commented 1 year ago

created_date mean day that original dataset is created if available, if not this will be the date that original dataset is obtained (ex. scraped from ... website on 2023 June 6) @Chawak