Open dacorvo opened 7 months ago
Thank you for filing the issue. We have found a fix for the problem and it will be available in an upcoming release.
Currently continuous batching support has only been officially released with Llama: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html#overview-of-continuous-batching-api-and-vllm-support
Mistral/Mixtral are planned for future releases. We will update this ticket when we have released official support for the Mixtral model.
Then Mistral and Mixtral are actually not supported, because static batching with padding (the alternative to continuous batching) has been broken for all models since the introduction of continuous batching: https://github.com/aws-neuron/transformers-neuronx/issues/79. Or has it been fixed ?
We had the 2.19 release going out this week. With this new release we have now added support for Mistral. Support for Mixtral would be added in one of the upcoming releases.
which AWS Neuron Image shall I roll back in order to correctly run Mixstral?
In the latest AWS Neuron SDK 2.18.1 release, the
transformers-neuronx
package has been updated to a new version0.10.0.360
whose code is not available in this repository at the moment.One of the change is to 'fix' continuous batching, but it actually breaks the Mixtral model.
The symptom is that the first call to
forward
after encoding fails with:The root cause is a modification in the
base.py
file, method_prepare_for_par_ctx_rhs_padding
line 265.The
last_token_id
returned value used to be a scalar, but can now be a vector. This leads to self.numprocessed_tokens to also become a vector, which causes the error.