Support token level embeddings

Our current approach embeds datasets using Sentence Transformers that give us one embedding per "chunk" of text (so if we pass in 500 tokens of text or 100 tokens of text we always get 1 embedding). Sentence Transformers "pool" token embeddings into a single one, usually by either averaging or just taking the first or last one.

There is another technique gaining popularity called ColBERT that instead gives you the embeddings for each token. A recent model is jina-colbert-v2

One could also imagine just getting back the hidden states from something like LLama-3.1-8B and working with those token-level embeddings.

When you don't pool you don't throw away a bunch of information, but of course this will explode the file size of the stored embeddings. It still may be worth it, and some things we could do to support them.

One thing to try would be using RAGatouille to handle the nearest neighbor search.

Another thing to try is to store the SAE top features of the tokens rather than their full embedding vectors. Theoretically if an SAE is "good" it will reconstruct the embedding pretty well, but we could cut 4096 Llama embeddings down to e.g. 128 dimensions (64 indices and 64 activations for a top-64 SAE).

enjalot / latent-scope

Support token level embeddings #64