Closed pshivraj closed 1 year ago
PLEASE let me know if you figure it out. I've been struggling too to quantize models. I can't seem to figure out how to (1) quantize correctly and (2) run the models actually on ctranslate2. I have a script that runs ctranslate2 to do inference, but the model is still in its original format...and apparently ctranslate2 apparently just does the inference at a lower level...without actually quantizing it outright. This is something that could use better user instructions...
I think the quantization works for the 13b codellama model using the above command, I am inferring this since the model.bin
file size is around 13G in the llama-2-13b-chat-ct2
folder, based on the bin files for the original model I see that its total size is around 26gb
, though I am struggling to do this for 34b version though, since I am running into incorrect vocabulary length. How does the bin file look for you ?
Hi,
The associated tokenizer is returning more tokens than what is expected by the model.
It appears this issue will be fixed with https://huggingface.co/codellama/CodeLlama-34b-hf/discussions/9. I suggest that you add a comment in this PR for it to be merged faster.
>>> import transformers
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-hf")
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
>>> len(tokenizer.get_vocab())
32004
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-hf", revision="refs/pr/9")
>>> len(tokenizer.get_vocab())
32000
Here's the proposed change...Suppose someone could just change the specific file already on huggingface manually rather than waiting...lol
Actually you can also set the revision in the conversion command line. You can try the following to download the model from the pending PR:
ct2-transformers-converter --model codellama/CodeLlama-34b-hf --copy_files tokenizer.model --output_dir llama-2-34b-chat-ct2 --quantization int8_bfloat16 --low_cpu_mem_usage --force --revision refs/pr/9
@guillaumekln Thanks for referring the right resource. I was able to get the conversion going using the PR updates.
Hi,
I am trying to convert codellama-34 to int8_bfloat16 using the following command:
ct2-transformers-converter --model codellama/CodeLlama-34b-hf --copy_files tokenizer.model --output_dir llama-2-34b-chat-ct2 --quantization int8_bfloat16 --low_cpu_mem_usage --force
I am running the error on incorrect tokenizer length.
ValueError: Vocabulary has size 32004 but the model expected a vocabulary of size 32000
I ran this for codellama-13b and it works fine.
ct2-transformers-converter --model codellama/CodeLlama-13b-hf --copy_files tokenizer.model --output_dir llama-2-13b-chat-ct2 --quantization int8_bfloat16 --low_cpu_mem_usage --force
I tried looking into the inconsistent token count for this ValueError at https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/specs/model_spec.py#L590
It appears for codellama-13 I do get expected_vocabulary_size = self._vocabulary = 32016