Closed AnnyShen55 closed 3 months ago
Hi, if oobabooga / llama-tokenizer is specifically taking a long time, you can swap it out with any valid llama 1/2 tokenizer! e.g. https://huggingface.co/meta-llama/Llama-2-7b-hf
Thank you! During the first run of the prepare_train_data.sh, when it went to split_sharegpt_conversations.py", line 8, in
Hi, I'm unsure, since I can't replicate this issue. I would recommend making sure your package versions match ours and/or using the dockerfile with provide with the repository - https://github.com/allenai/open-instruct/blob/main/Dockerfile).
Using this docker image, I am able to run the data creation script fine, including the splitting part. Perhaps try running the splitting code seperately from the script, and run the rest of the script?
Additionally, adjusting the number of workers used in the processing might help: https://github.com/allenai/open-instruct/blob/main/scripts/split_sharegpt_conversations.py#L66 - sometimes lock contention can slow things down, especially if you have less than 128 cores. We usually run this script on an internal server with lots of resources. So I would say try using a few number of workers?
Thank you, your suggestion works fine!
in scripts /split_sharegpt_conversations.py line 96, tokenizer = transformers.AutoTokenizer.from_pretrained( args.model_name_or_path, use_fast=False, ) takes forever to read, where it uses oobabooga / llama-tokenizer here. Any ideas? Thank you!