allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Data out of bounds when using ‘dolma tokens --dtype uint32’ #142

Open Jackwaterveg opened 5 months ago

Jackwaterveg commented 5 months ago
image

After using commad

dolma tokens \
    --documents "dataset/${data_source}_add_id" \
    --tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \
    --destination dataset/${data_source}_npy \
    --tokenizer.eos_token_id 151643\
    --tokenizer.pad_token_id 151646 \
    --dtype "uint32" \
    --processes 20

I use the code below to read the memmap file. The data is out of bounds as shown above and the vocab size is only 150000. data = MemMapDataset(filePath, chunk_size=2048, memmap_dtype="uint32")

soldni commented 4 months ago

Thank you for the report @Jackwaterveg. Could you re-run the command above with --dryrun to show the full configuration? thanks.