Open dacorvo opened 4 months ago
Note that the NaN issue when masking does not happen if the inputs are small enough (less than 15 tokens). It does not happen either with gpt2
.
Just to give an idea of the consequences for inference.
Considering the input prompt "One of my fondest memory is of my grandmother making homemade bread".
Inference result with twice the same prompt in the same batch, hence no padding:
'<s> One of my fondest memory is of my grandmother making homemade bread. It was a special occasion, like a birthday or holiday',
'<s> One of my fondest memory is of my grandmother making homemade bread. It was a special occasion, like a birthday or holiday'
Inference result with one of the prompts slightly longer (added a .
), hence with a 1-token masked padding:
'</s><s> One of my fondest memory is of my grandmother making homemade bread<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>',
'<s> One of my fondest memory is of my grandmother making homemade bread.<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>'
If you do the same thing, but omitting the mask, the results are non-deterministic anymore:
'</s><s> One of my fondest memory is of my grandmother making homemade bread. She would mix the dough by hand, kneading it',
'<s> One of my fondest memory is of my grandmother making homemade bread. It was a special occasion, like a birthday or holiday'
If you do the same thing with even more padding, then the outputs are complete gibberish:
'</s></s></s></s></s></s></s></s></s></s></s><s> One of my fondest memory is of my grandmother making homemade breadMSMSMS',
'<s> One of my fondest memory is of my grandmother making homemade bread. It was a special occasion, like a birthday or holiday'
Thank you for reporting the issue, we are trying to reproduce on our end. Just to confirm, you are using transformers_neuronx from 2.16 release?
Yes.
We were able to reproduce the NaNs using the command: python test_padded_inputs.py <llama-path> --input-length 64 --mask-inputs
. We are now looking into it.
Bump: more urgent now that continuous batching is also broken for Mistral and Mixtral.
In previous versions of
transformers_neuronx
, one could usestart_ids
to mask inputs during the inference of Llama models.Now, when specifying anything else than
None
or0
when calling the modelforward()
method returns NaN scores.Here is an example script to illustrate the issue:
Assuming you have saved a Llama checkpoint under
<llama-path>
, you will get the following results:Is this expected ? I noticed that the new continuous batching feature interprets
start_ids
asseq_ids
: shouldstart_ids
be used only to differentiate between active/inactive sequences from now on, and not to mask inputs ?If so, how are we suppose to deal with inputs of different length in the same batch ? Without masking, the outputs are complete gibberish.