Closed ArthurMinovsky closed 1 year ago
Before training data, we want to blind people's information especially
Input:
Output:
<person> <person_id>
Our code should work with the Huggingface dataset .map function to be easily parallelized on the Lanta node https://huggingface.co/docs/datasets/process
.map
Our potential code candidates
Side Note: Example how to download oscar dataset from huggingface:
from datasets import load_dataset def download(): text_dataset = load_dataset("oscar", "unshuffled_deduplicated_th")
Please make sure you set export HF_DATASETS_CACHE="/folder_for_storing_data"
export HF_DATASETS_CACHE="/folder_for_storing_data"
Before training data, we want to blind people's information especially
Input:
Output:
<person> <person_id>
Our code should work with the Huggingface dataset
.map
function to be easily parallelized on the Lanta node https://huggingface.co/docs/datasets/processOur potential code candidates
Side Note: Example how to download oscar dataset from huggingface:
Please make sure you set
export HF_DATASETS_CACHE="/folder_for_storing_data"