The only remaining blocker is the memory footprint.
The cause is that we load 2 times the decoder weights during the conversion (one with cache support the other without).
We can avoid that by using the If trick during the conversion like we do after it.
Right now the
T5
notebook is based on large flavor. There was 2 blocking issues with 3B flavor:The ORT bug has been fixed by https://github.com/microsoft/onnxruntime/pull/11650 context: https://github.com/microsoft/onnxruntime/issues/11511
The only remaining blocker is the memory footprint. The cause is that we load 2 times the decoder weights during the conversion (one with cache support the other without). We can avoid that by using the If trick during the conversion like we do after it.