Closed joshuawe closed 2 months ago
you should be pointing this PR to 93-tokenize-... instead of main then when you merge 93-tokenize-... this one would automatically update to point on main again
looks like the merges caused the displayed diff to be wrong
For the reviewers. Please let this command run once, and verify it uploaded your dataset. It worked for me for a subset of the dataset. But my RAM memory was not sufficient for tokenizing the entire dataset in one go. :(
python ./scripts/tokenize_dataset.py --token HF_TOKEN --input-dataset-name delphi-suite/stories --tokenizer-name delphi-suite/stories-tokenizer --output-dataset-name NEW_HF_DATASET_NAME --column-name=story
@jettjaniak @siwei-li
Weird, on my machine it used just ~1 GB of memory
But it's failing with
[1] 87023 killed ./scripts/tokenize_dataset.py --hf-token hf_cHQmKbyWcgrUxZQAgUWuphVtJvheAGFSB
/opt/homebrew/Cellar/python@3.10/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
I think one of these calls is at fault
# Store the tokenized data in a new dataset for this split
tokenized_datasets[split] = Dataset.from_dict({"tokens": tokenized_dataset})
# Create a new dataset with the same structure (splits) as the original dataset, but with tokenized data
output_dataset = DatasetDict(tokenized_datasets)
I added scripts/demo_upload_in_chunks.py
as an example how to upload the dataset in chunks, we should adapt the tokenization script accordingly
I reduced memory usage, but broke tests (should be easy to fix)
this https://huggingface.co/datasets/delphi-suite/stories-tokenized is the result of
scripts/tokenize_dataset.py -i delphi-suite/stories -f story -s SPLIT -o delphi-suite/stories-tokenized -r delphi-suite/stories-tokenizer -l 512 -t hf_...
where SPLIT={train, validation}
(two separate commands)
one of the unit tests fails because I replaced delphi-suite/stories-tokenizer with a different tokenizer, that needs updating too
Fixes #105
Fixing the tokenize dataset script, where currently only
delphi-suite/stories
dataset is supported with its (unique) structure of parquet files. The script should be able to download all suitable HF datasets even if they have a slightly different structure.Note: Needs to be rebased on #94 once that branch is rebased on main again