(Eventual) GPU Acceleration

varunneal commented 1 year ago

GPU Acceleration of transformers is possible, but it is hacky.

Requires an unmerged-pr version of transformers.js that relies on a patched version of onnxruntime-node.

Xenova plans on merging this PR only after onnxruntime has official support for GPU Acceleration. In the meantime, this change could be implemented, potentially as an advanced "experimental" feature.

do-me commented 1 year ago

This is really cool and paves the way for LLMs running in the browser!

I had this idea in my head for a while now: we already have a kind of (primitive) vector DB (just a JSON) and the small model for embeddings. If we added a LLM for Q&A/ text generation based solely on the infos in the text this would be huge!

I already talked to the folks from Qdrant on their discord server if they'd be interested in providing a JS/webassembly version of their Rust-based vector DB (as they developed plenty of optimizations) but for the moment they have other priorities. Still, they said they might go for it at some point.

Anyway, I think this would make for an interesting POC to explore this. About the idea to integrate it directly, until it's officially supported, we could maybe detect the web-GPU support automatically and simply load the right version? Or does the web-GPU version also support CPU?

P.S. There would be so much fun in it for NLP with LLMs if for example we'd created an image of all leitmotifs in the text or some kind of text summary image or similar for a visual understanding of text...

lizozom commented 1 year ago

I am working on a similar effort myself, lets cooperate!

More specifically, I wanted to use this project as a basis for an SDK that allows one to run semantic search on their own website's content.

do-me commented 1 year ago

Sounds great!
It's also on the feature/idea list of the readme.md that this repo could become a browser plugin for FF or Chrome. Of course, it would need a leaner GUI.

I was thinking of some kind of bar integrated on top of a webpage like Algolia / lunr etc. do. Good example: on mkdocs material homepage:

(By the way, I also had ideas for integrating semantic search in mkdocs, but I'm lacking the time atm...)

What about your idea?

(We're kind of drifting away from this issue's topic, let's move to discussions: https://github.com/do-me/SemanticFinder/discussions/15)

do-me commented 3 months ago

We're finally getting closer to WebGPU support: https://github.com/xenova/transformers.js/pull/545 It's already usable in the dev branch. I'm really excited about this as people are reporting accelerations of factor 20x-100x!

In my case (M3 Max) I'm getting a massive inferencing speedup of 32x-46x. See for yourself: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark

Even with cheap integrated graphics (Intel "integrated" GPUs like UHD or Iris Xe) I get a 4x-8x boost. So literally everyone would see massive speed gains!

This is the most notably performance improvement I see atm, hence referencing #49.

I hope that transformers.js will allow for some kind of automatic setting where WebGPU is used if available but else falls back to plain CPU.

varunneal commented 3 months ago

Speedup is about 10x for me on an M1. Definitely huge. Not sure how embeddings will compare to inference in terms of GPU optimization but I think there is huge room for parallelization.

do-me / SemanticFinder

(Eventual) GPU Acceleration #11