Open candlewill opened 4 months ago
@candlewill thanks for looking into this; I think this would be a good change to make, though I wasn't able to get it to work by naively applying it. If you have a working branch with this change applied would you be willing to make a pull request?
@candlewill Hey :) Would you mind uploading the required changes as patch file or sharing a forked git version with your optimization? I would be curious to try it out and have trouble applying the second part of your patch in the main inference loop.
I'm also interested in this. Increasing the overall performance when using single-threaded CPU (with AVX-512) would be awesome, if possible. And also, to identify which parts are loading (before message) and which parts come after the message is known.
Thank you for this excellent implementation. I'd like to suggest an optimization that could significantly speed up inference and enable streaming output.
Currently, there are two GPT2 graphs:
Since CLVP has been removed, we can streamline this to a single GPT2 graph that directly generates latents. I've implemented this with minimal changes:
autoregressive_graph
, add aftercur = ggml_add(ctx0, cur, model.language_model_head_layer_norm_bias);
:Benefits:
This optimization could significantly benefit users looking to speed up inference or implement streaming latent generation.