This PR introduces a custom inference session class that improves how ONNX inference is handled within TransformersPHP. Previously, the original inference session from ankane/onnnxruntime-php processed inputs and outputs as flat arrays. Since TransformersPHP primarily works with tensors, this created extra conversion steps.
Here's how the original process worked:
Convert input tensor to a standard PHP array.
Convert PHP array to a C array for the ONNX session.
Model processes the input and returns a C array.
Convert the C array back to a multidimensional PHP array.
Convert the PHP array back to a tensor.
With the new custom inference session, these unnecessary conversions and overheads are eliminated, conserving memory and improving performance. Now, the session accepts a tensor as input and returns a tensor as output, streamlining the process to:
Convert input tensor to a C array for the ONNX session.
Model processes the input and returns a C array.
Convert the C array back to a tensor.
Additionally, the custom inference session resolves an issue with zero-sized tensors. The previous approach struggled with zero-sized arrays, losing contextual information about their shape. By working directly with tensors, the new inference session retains shape information even for zero-sized tensors. This allows for accurate memory allocation and shape management, eliminating the need for manual adjustments for attention masks in decoder models.
In summary, this PR optimizes the handling of inputs and outputs in ONNX inference sessions, reducing conversion overhead and improving memory management and performance for larger model inputs and outputs.
What:
Description:
This PR introduces a custom inference session class that improves how ONNX inference is handled within TransformersPHP. Previously, the original inference session from ankane/onnnxruntime-php processed inputs and outputs as flat arrays. Since TransformersPHP primarily works with tensors, this created extra conversion steps.
Here's how the original process worked:
With the new custom inference session, these unnecessary conversions and overheads are eliminated, conserving memory and improving performance. Now, the session accepts a tensor as input and returns a tensor as output, streamlining the process to:
Additionally, the custom inference session resolves an issue with zero-sized tensors. The previous approach struggled with zero-sized arrays, losing contextual information about their shape. By working directly with tensors, the new inference session retains shape information even for zero-sized tensors. This allows for accurate memory allocation and shape management, eliminating the need for manual adjustments for attention masks in decoder models.
In summary, this PR optimizes the handling of inputs and outputs in ONNX inference sessions, reducing conversion overhead and improving memory management and performance for larger model inputs and outputs.