enjalot / latent-scope

A scientific instrument for investigating latent spaces
MIT License
548 stars 18 forks source link

Support token level embeddings #64

Open enjalot opened 4 days ago

enjalot commented 4 days ago

Our current approach embeds datasets using Sentence Transformers that give us one embedding per "chunk" of text (so if we pass in 500 tokens of text or 100 tokens of text we always get 1 embedding). Sentence Transformers "pool" token embeddings into a single one, usually by either averaging or just taking the first or last one.

There is another technique gaining popularity called ColBERT that instead gives you the embeddings for each token. A recent model is jina-colbert-v2

One could also imagine just getting back the hidden states from something like LLama-3.1-8B and working with those token-level embeddings.

When you don't pool you don't throw away a bunch of information, but of course this will explode the file size of the stored embeddings. It still may be worth it, and some things we could do to support them.

One thing to try would be using RAGatouille to handle the nearest neighbor search.

Another thing to try is to store the SAE top features of the tokens rather than their full embedding vectors. Theoretically if an SAE is "good" it will reconstruct the embedding pretty well, but we could cut 4096 Llama embeddings down to e.g. 128 dimensions (64 indices and 64 activations for a top-64 SAE).

enjalot commented 9 hours ago

this is an interesting technique for reducing storage footprint of token level embeddings https://arxiv.org/html/2409.14683v1