oobabooga / llama-tokenizer takes forever to read

AnnyShen55 commented 3 months ago

in scripts /split_sharegpt_conversations.py line 96, tokenizer = transformers.AutoTokenizer.from_pretrained( args.model_name_or_path, use_fast=False, ) takes forever to read, where it uses oobabooga / llama-tokenizer here. Any ideas? Thank you!

hamishivi commented 3 months ago

Hi, if oobabooga / llama-tokenizer is specifically taking a long time, you can swap it out with any valid llama 1/2 tokenizer! e.g. https://huggingface.co/meta-llama/Llama-2-7b-hf

AnnyShen55 commented 3 months ago

Thank you! During the first run of the prepare_train_data.sh, when it went to split_sharegpt_conversations.py", line 8, in import transformers i got the error: from torch._C import * # noqa: F403 ImportError: libcupti.so.12: cannot open shared object file: No such file or directory But I did install the cuda toolkit, and then when I changed nothing and rerun the the prepare_train_data.sh again, it stuck on the tokenizer part listed above. Do you think whether the issues are related?

hamishivi commented 3 months ago

Hi, I'm unsure, since I can't replicate this issue. I would recommend making sure your package versions match ours and/or using the dockerfile with provide with the repository - https://github.com/allenai/open-instruct/blob/main/Dockerfile).

Using this docker image, I am able to run the data creation script fine, including the splitting part. Perhaps try running the splitting code seperately from the script, and run the rest of the script?

Additionally, adjusting the number of workers used in the processing might help: https://github.com/allenai/open-instruct/blob/main/scripts/split_sharegpt_conversations.py#L66 - sometimes lock contention can slow things down, especially if you have less than 128 cores. We usually run this script on an internal server with lots of resources. So I would say try using a few number of workers?

AnnyShen55 commented 3 months ago

Thank you, your suggestion works fine!

allenai / open-instruct

oobabooga / llama-tokenizer takes forever to read #207