dataset tokenization script improvements

delphi-suite / delphi

small language models training made easy

Apache License 2.0

8 stars 1 forks source link

dataset tokenization script improvements #106

Closed joshuawe closed 2 months ago

joshuawe commented 3 months ago

Fixes #105

Fixing the tokenize dataset script, where currently only delphi-suite/stories dataset is supported with its (unique) structure of parquet files. The script should be able to download all suitable HF datasets even if they have a slightly different structure.

Note: Needs to be rebased on #94 once that branch is rebased on main again

jettjaniak commented 3 months ago

you should be pointing this PR to 93-tokenize-... instead of main then when you merge 93-tokenize-... this one would automatically update to point on main again

jettjaniak commented 3 months ago

looks like the merges caused the displayed diff to be wrong

joshuawe commented 2 months ago

For the reviewers. Please let this command run once, and verify it uploaded your dataset. It worked for me for a subset of the dataset. But my RAM memory was not sufficient for tokenizing the entire dataset in one go. :(

python ./scripts/tokenize_dataset.py --token HF_TOKEN --input-dataset-name delphi-suite/stories --tokenizer-name delphi-suite/stories-tokenizer --output-dataset-name NEW_HF_DATASET_NAME --column-name=story

@jettjaniak @siwei-li