OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.28k stars 287 forks source link

Codellama-34b conversion #1456

Closed pshivraj closed 1 year ago

pshivraj commented 1 year ago

Hi,

I am trying to convert codellama-34 to int8_bfloat16 using the following command:

ct2-transformers-converter --model codellama/CodeLlama-34b-hf --copy_files tokenizer.model --output_dir llama-2-34b-chat-ct2 --quantization int8_bfloat16 --low_cpu_mem_usage --force

I am running the error on incorrect tokenizer length.

ValueError: Vocabulary has size 32004 but the model expected a vocabulary of size 32000

I ran this for codellama-13b and it works fine.

ct2-transformers-converter --model codellama/CodeLlama-13b-hf --copy_files tokenizer.model --output_dir llama-2-13b-chat-ct2 --quantization int8_bfloat16 --low_cpu_mem_usage --force

I tried looking into the inconsistent token count for this ValueError at https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/specs/model_spec.py#L590

It appears for codellama-13 I do get expected_vocabulary_size = self._vocabulary = 32016

BBC-Esq commented 1 year ago

PLEASE let me know if you figure it out. I've been struggling too to quantize models. I can't seem to figure out how to (1) quantize correctly and (2) run the models actually on ctranslate2. I have a script that runs ctranslate2 to do inference, but the model is still in its original format...and apparently ctranslate2 apparently just does the inference at a lower level...without actually quantizing it outright. This is something that could use better user instructions...

pshivraj commented 1 year ago

I think the quantization works for the 13b codellama model using the above command, I am inferring this since the model.bin file size is around 13G in the llama-2-13b-chat-ct2 folder, based on the bin files for the original model I see that its total size is around 26gb, though I am struggling to do this for 34b version though, since I am running into incorrect vocabulary length. How does the bin file look for you ?

guillaumekln commented 1 year ago

Hi,

The associated tokenizer is returning more tokens than what is expected by the model.

It appears this issue will be fixed with https://huggingface.co/codellama/CodeLlama-34b-hf/discussions/9. I suggest that you add a comment in this PR for it to be merged faster.

>>> import transformers
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-hf")
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
>>> len(tokenizer.get_vocab())
32004
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-hf", revision="refs/pr/9")
>>> len(tokenizer.get_vocab())
32000
BBC-Esq commented 1 year ago

Here's the proposed change...Suppose someone could just change the specific file already on huggingface manually rather than waiting...lol
image

guillaumekln commented 1 year ago

Actually you can also set the revision in the conversion command line. You can try the following to download the model from the pending PR:

ct2-transformers-converter --model codellama/CodeLlama-34b-hf --copy_files tokenizer.model --output_dir llama-2-34b-chat-ct2 --quantization int8_bfloat16 --low_cpu_mem_usage --force --revision refs/pr/9

pshivraj commented 1 year ago

@guillaumekln Thanks for referring the right resource. I was able to get the conversion going using the PR updates.