Open 0hq opened 1 year ago
I've been implementing and using pretty much the same ideas you're thinking of in tensorflow and java series.
Of course, I did the exact same thing with Pytorch, and the problem of finding the top k was also considered, as well as batch processing, dynamic batch processing, etc.
If you take a look at my code and agree with the direction I think the implementation should go, I'll contribute to this repository.
Let's start GPU accelerating with a Pytorch index. Dot products/cosine similarity are both nearly equivalent to a matrix multiplication, so using hardware accelerators seems to be useful here. On 32 GB of VRAM, we could fit 22 million MiniLM embeddings (384 dimensions on f32 precision) on a single GPU.