Closed RohitRathore1 closed 7 months ago
Hi @RohitRathore1 ! Met the same issue, and I think maybe we should change tokenizer_name_or_path
to tokenizer.name_or_path
.
here's my quick fixed script:
dolma tokens \
--documents "wikipedia/example0/documents/*.gz" \
--tokenizer.name_or_path "EleutherAI/gpt-neox-20b" \
--tokenizer.bos_token_id 0 \
--destination wikipedia/example0/tokens \
--processes 16
Hi @RohitRathore1 ! Met the same issue, and I think maybe we should change
tokenizer_name_or_path
totokenizer.name_or_path
.here's my quick fixed script:
dolma tokens \ --documents "wikipedia/example0/documents/*.gz" \ --tokenizer.name_or_path "EleutherAI/gpt-neox-20b" \ --tokenizer.bos_token_id 0 \ --destination wikipedia/example0/tokens \ --processes 16
Hi, @koalazf99. Thanks! Yes, you are right. We should use tokenizer.name_or_path
and there are some typo issues in the documents. By the way can you verify this? After this dolma tokens
command I got these results:
dolma tokens \
> --documents "wikipedia/example0/documents/*.gz" \
> --tokenizer.name_or_path "EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json" \
> --tokenizer.bos_token_id 0 \
> --destination wikipedia/example0/tokens \
> --processes 16
batch_size: 10000
debug: false
destination: wikipedia/example0/tokens
documents:
- wikipedia/example0/documents/*.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 16
ring_size: 8
seed: 3920
tokenizer:
bos_token_id: 0
eos_token_id: null
name_or_path: EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
pad_token_id: null
segment_before_tokenization: false
tokenizer_name_or_path: null
work_dir:
input: null
output: null
files: 0.00f [00:00, ?f/s] 2024-02-04 09:07:49,914 WARNING SpawnPoolWorker-15.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,916 WARNING SpawnPoolWorker-13.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,917 WARNING SpawnPoolWorker-16.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,919 WARNING SpawnPoolWorker-3.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,919 WARNING SpawnPoolWorker-5.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-12.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-1.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-11.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-9.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-14.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-6.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,958 WARNING SpawnPoolWorker-2.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,967 WARNING SpawnPoolWorker-10.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,983 WARNING SpawnPoolWorker-4.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,987 WARNING SpawnPoolWorker-15.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:49,994 WARNING SpawnPoolWorker-8.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:50,030 WARNING SpawnPoolWorker-14.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,032 WARNING SpawnPoolWorker-5.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,034 WARNING SpawnPoolWorker-2.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,034 WARNING SpawnPoolWorker-7.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:50,035 WARNING SpawnPoolWorker-13.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,037 WARNING SpawnPoolWorker-12.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,041 WARNING SpawnPoolWorker-9.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,043 WARNING SpawnPoolWorker-3.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,047 WARNING SpawnPoolWorker-16.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,049 WARNING SpawnPoolWorker-11.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
files: 0.00f [00:00, ?f/s] 2024-02-04 09:07:50,056 WARNING SpawnPoolWorker-1.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,056 WARNING SpawnPoolWorker-10.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,062 WARNING SpawnPoolWorker-4.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,067 WARNING SpawnPoolWorker-6.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,084 WARNING SpawnPoolWorker-8.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,130 WARNING SpawnPoolWorker-7.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
memmaps: 16.0m [00:00, 20.0m/s]
tokens: 0.00t [00:00, ?t/s]/s]
documents: 0.00d [00:00, ?d/s]
files: 1.00f [00:00, 1.25f/s]s]
Those warnings are expected if you are not providing pad_token_id
. You provably want to add --tokenizer.pad_token_id 1
when calling CLI.
I am experiencing an issue while tokenizing the Wikipedia dataset mentioned in the following step. I am having my tokenizer file in the root of this repository and my relative path is following:
dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
.The traceback of the error is following: