Latest changes introduced for continuous batching break Mixtral model

dacorvo commented 7 months ago

In the latest AWS Neuron SDK 2.18.1 release, the transformers-neuronx package has been updated to a new version 0.10.0.360 whose code is not available in this repository at the moment.

One of the change is to 'fix' continuous batching, but it actually breaks the Mixtral model.

The symptom is that the first call to forward after encoding fails with:

    def forward(self, input_ids, cache_ids=None, start_ids=None):                           
        # Compute the window starting index for specific mask patterns                                                                                                                  
        # For other patterns we pass in a default value of 0, it won't be used                                                                                                                                                                                                                                                                              
>       curr_window_start = max(0, self.num_processed_tokens - self.config.window_size) if self.config.window_size else 0                                                               
E       RuntimeError: Boolean value of Tensor with more than one value is ambiguous

The root cause is a modification in the base.py file, method _prepare_for_par_ctx_rhs_padding line 265.

The last_token_id returned value used to be a scalar, but can now be a vector. This leads to self.numprocessed_tokens to also become a vector, which causes the error.

hannanjgaws commented 7 months ago

Thank you for filing the issue. We have found a fix for the problem and it will be available in an upcoming release.

hannanjgaws commented 7 months ago

Currently continuous batching support has only been officially released with Llama: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html#overview-of-continuous-batching-api-and-vllm-support

Mistral/Mixtral are planned for future releases. We will update this ticket when we have released official support for the Mixtral model.

dacorvo commented 7 months ago

Then Mistral and Mixtral are actually not supported, because static batching with padding (the alternative to continuous batching) has been broken for all models since the introduction of continuous batching: https://github.com/aws-neuron/transformers-neuronx/issues/79. Or has it been fixed ?

aws-rhsoln commented 4 months ago

We had the 2.19 release going out this week. With this new release we have now added support for Mistral. Support for Mixtral would be added in one of the upcoming releases.

zhouku92 commented 4 months ago

which AWS Neuron Image shall I roll back in order to correctly run Mixstral?

aws-neuron / transformers-neuronx

Latest changes introduced for continuous batching break Mixtral model #84