Closed dennj closed 4 months ago
I could fix the problem by setting n_positions=4096
before compiling the model.
neuron_model = LlamaForSampling.from_pretrained('./Llama-2-13b-split', batch_size=1, tp_degree=xla_device_count, n_positions=4096, amp='f16')
Maybe this flag should be added to the tutorial to avoid other people having to deal with the same problem :)
Thanks @dennj for pointing out the issues. Yes, you can increase the number of positions by setting the n_positions variable. For example, to support up to 4k positions, you could do the following:
neuron_model = LlamaForSampling.from_pretrained(‘./Llama-2-13b-split’, batch_size=1, tp_degree=xla_device_count, amp=‘f16’, n_positions=4096)
We’ll update the documentations accordingly.
Thanks :) Updating the documentation will be helpful.
However the behaviour is non-deterministic, is ok to have a random behaviour in the library?
By default the tutorial uses top_k=50
sampling, which performs multinomial sampling. If you would like to do deterministic sampling, you can set top_k=1
in your sampling call. This will perform deterministic greedy sampling
Hi dennj - okay to close this ticket or do you have any other questions?
Closing since there were no further comments
I am trying to use
meta-llama/Llama-2-13b-chat-hf
witch have amax_position_embeddings
of 4096 tokens. I found that the library fails in a non-deterministic way when input length is between 1790 and 1800 tokens. If you insert exactly the same prompt several times you randomly get a good output or a failure. While over the 1800 tokens the failure become more deterministic. However LLaMA with Huggingface transformer library works fine with more than 2000 tokens.Here a piece of code to reproduce the error. Model preparation:
Reproduce the bug:
As I said the bug is not deterministic so the code will fail every time to a different iteration. Here an example: