OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

[Super AI] Make pdpa blind data code #174

Closed ArthurMinovsky closed 1 year ago

ArthurMinovsky commented 1 year ago

Before training data, we want to blind people's information especially

  1. Person Name
  2. Personal Identification Number (บัตรประชาชน)

Input:

Output:

Our code should work with the Huggingface dataset .map function to be easily parallelized on the Lanta node https://huggingface.co/docs/datasets/process

Our potential code candidates

Side Note: Example how to download oscar dataset from huggingface:

from datasets import load_dataset

def download():
    text_dataset = load_dataset("oscar", "unshuffled_deduplicated_th")

Please make sure you set export HF_DATASETS_CACHE="/folder_for_storing_data"