delphi-suite / delphi

small language models training made easy
Apache License 2.0
8 stars 1 forks source link

dataset tokenization script improvements #106

Closed joshuawe closed 2 months ago

joshuawe commented 3 months ago

Fixes #105

Fixing the tokenize dataset script, where currently only delphi-suite/stories dataset is supported with its (unique) structure of parquet files. The script should be able to download all suitable HF datasets even if they have a slightly different structure.

Note: Needs to be rebased on #94 once that branch is rebased on main again

jettjaniak commented 3 months ago

you should be pointing this PR to 93-tokenize-... instead of main then when you merge 93-tokenize-... this one would automatically update to point on main again

jettjaniak commented 3 months ago

looks like the merges caused the displayed diff to be wrong

image
joshuawe commented 2 months ago

For the reviewers. Please let this command run once, and verify it uploaded your dataset. It worked for me for a subset of the dataset. But my RAM memory was not sufficient for tokenizing the entire dataset in one go. :(

python ./scripts/tokenize_dataset.py --token HF_TOKEN --input-dataset-name delphi-suite/stories --tokenizer-name delphi-suite/stories-tokenizer --output-dataset-name NEW_HF_DATASET_NAME --column-name=story

@jettjaniak @siwei-li

jettjaniak commented 2 months ago

Weird, on my machine it used just ~1 GB of memory

jettjaniak commented 2 months ago

But it's failing with

[1]    87023 killed     ./scripts/tokenize_dataset.py --hf-token hf_cHQmKbyWcgrUxZQAgUWuphVtJvheAGFSB
/opt/homebrew/Cellar/python@3.10/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I think one of these calls is at fault

        # Store the tokenized data in a new dataset for this split
        tokenized_datasets[split] = Dataset.from_dict({"tokens": tokenized_dataset})

    # Create a new dataset with the same structure (splits) as the original dataset, but with tokenized data
    output_dataset = DatasetDict(tokenized_datasets)
jettjaniak commented 2 months ago

I added scripts/demo_upload_in_chunks.py as an example how to upload the dataset in chunks, we should adapt the tokenization script accordingly

jettjaniak commented 2 months ago

I reduced memory usage, but broke tests (should be easy to fix)

this https://huggingface.co/datasets/delphi-suite/stories-tokenized is the result of scripts/tokenize_dataset.py -i delphi-suite/stories -f story -s SPLIT -o delphi-suite/stories-tokenized -r delphi-suite/stories-tokenizer -l 512 -t hf_... where SPLIT={train, validation} (two separate commands)

jettjaniak commented 2 months ago

one of the unit tests fails because I replaced delphi-suite/stories-tokenizer with a different tokenizer, that needs updating too