Open pannous opened 8 months ago
func encode(_ text: borrowing String) -> [Token]
should do it.
let text = "hello world"
let llm = LLM(...)
let embeddings = llm.encode(text)
and for decoding:
let decodedText = llm.model.decode(embeddings)
Thanks, I thought about vector embeddings though: llm.embedding("King") ≈ llm.embedding("Queen") ≈ [Float]*768
[Token] would just be ≈ one int per word
i think you have the wrong definition of LLM embeddings, and it's understandable because i was also once confused about the concept. you might want to check this comment. it's also the reason why i chose not to use the word "embedding" in this library
if you want to test the similarities between embeddings, you can cast the [Token]
that's outputted by the encode
function as [Float]
and use it in a vector DB, or check the cosine similarities between your choices.
however, for checking similarity between simple words like "king" or "queen" in your example, i suggest you to just use the apple's Natural Language
framework like this, as LLM token is chosen somewhat arbitrarily(as far as i know), there is no guarantee that "king" and "queen" is similar than "king" and "monitor".
although i haven't tested this myself but for just checking similarities between sentences or words you should use this similarity-search-kit library or use it with this together.
Thanks, very very helpful links!!!
LLM embeddings are usually a vector of floats, Token encodings are a vector of ints, casting these to float makes no sense, no confusion here ;)
i was just saying that you have the option to cast an array of ints as array of floats so that you can check cosine similarity. after all, int is also just a valid one dimensional vector.
i'm glad i was able to help you!
so, i researched a bit further on this, being not so sure if i understood the concept correctly. what you are referring to is not related to LLM, indeed. however, embedding models are usually used with LLM, usually for text search, and that's where the confusion occurs. aside from the fact that sometimes some refer tokens as embeddings, that is.
for example, mistral uses mistral-embed
model, and openAI uses text-embedding-3-small
, text-embedding-3-small
, and text-embedding-ada-002
. they are used and can be used in conjunction with LLM like mistral 7B
, but it has no direct relation with LLMs. as machine learning models are just blackbox that we can not really look into so far, there is not a direct way to retrieve the "actual internal representation" of tokens using LLM either, other than tokens.
word2vec embeddings were not related to LLM, but today embeddings are (mostly) done via LLMs, or as you pointed out correctly SLMs (small language models) although some believe that using truely large LLMs also give better embeddings.
there is not a direct way to retrieve the "actual internal representation"
I think your research yielded a wrong result there. while using all current activations as embedding would be overkill, LLM embeddings are indeed calculated through some activations:
• Pooling Strategies: Applying operations such as mean or max pooling over activations from one or more layers to create fixed-size embeddings.
• Concatenation of Multiple Layers: Combining activations from multiple layers to form a richer representation.
• Last Layer Embeddings: Using the activations from the last hidden layer of the model as the embedding for a word or sentence. (indeed makes no sense if last layer is tokens)
thank you for the clarification and a clear explanation. i was the one who had the wrong idea. my bad. i'll look into this more, find a way to get embeddings through methods that you described, and keep you updated here. i really appreciate it. it's hard to get the right information on LLM related field as a non-researcher. i have to learn more on this.
i'll see if i can implement this in my library referencing this code: https://github.com/ggerganov/llama.cpp/blob/master/examples/embedding/embedding.cpp#L54
until then, in llama.cpp
library that this one depends on, it seems that you will able to get the LLM float embeddings that you want by using float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id)
.
Hi, great project!
How hard would it be to extract embeddings from the LLMs?