I'm trying to compile Bart for text2text generation on an Inf2 server. I am aware that optimum-neuron has a Bart implementation, but I need to be able to make customizations that are incompatible with the pipeline system.
Bart is implemented so that if you pass past_key_values, you can provide only the last decoder input ID rather than the whole string. This speeds up the attention, so that it's linear per step rather than quadratic time, because it only has to run for one position rather than all positions so far. This is an important compute optimisation.
When I try to trace a call to Bart that uses this optimisation, I get an error:
2023-07-04T12:38:29Z ERROR 26864 [Tensorizer]: Transformation error on operator: mlir.function
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: ***************************************************************
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: An Internal Compiler Error has occurred
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: ***************************************************************
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]:
2023-07-04T12:38:29Z ERROR 26864 [neuronx-cc]: Error message: too many values to unpack (expected 1)
I'm trying to compile Bart for text2text generation on an Inf2 server. I am aware that optimum-neuron has a Bart implementation, but I need to be able to make customizations that are incompatible with the pipeline system.
Bart is implemented so that if you pass past_key_values, you can provide only the last decoder input ID rather than the whole string. This speeds up the attention, so that it's linear per step rather than quadratic time, because it only has to run for one position rather than all positions so far. This is an important compute optimisation.
When I try to trace a call to Bart that uses this optimisation, I get an error:
Steps to reproduce:
Let me know if you have trouble reproducing it and need additional details.
Many thanks.