dleemiller / WordLlama

Things you can do with the token embeddings of an LLM
MIT License
1.39k stars 47 forks source link

Doubts about utility to multilingual models #30

Open TheMrguiller opened 1 month ago

TheMrguiller commented 1 month ago

Hi,

I really like your project as it provides an easy-to-use approach. I have been thinking that since the new Llama 3.1 is multilingual, could this approach also be used in that way? As we are only working with the token embeddings without the contextualization part, maybe the embeddings are not as representative. Any thoughts on that? Could you explain further your logic and how your text embedding process works? Has it been successful with larger sentences, considering that it is a simple averaging?

Thank you so much @dleemiller @jimexist

dleemiller commented 1 month ago

There is no reason it can't work for multilingual, but it's not only the initializing model that determines it. The current models I have do not have training data for covering a multilingual task. The main limitation is that WordLlama is trained using the standard datasets that many popular embedding models are trained on:

4.1 Public Retrieval Datasets

We adopt the retrieval datasets as follows: MS MARCO (Bajaj et al., [2016](https://arxiv.org/html/2405.17428v1#bib.bib4)), HotpotQA (Yang et al., [2018](https://arxiv.org/html/2405.17428v1#bib.bib66)), Natural Question (Kwiatkowski et al., [2019](https://arxiv.org/html/2405.17428v1#bib.bib21)), PAQ (Lewis et al., [2021](https://arxiv.org/html/2405.17428v1#bib.bib26)), Stackexchange (Stack-Exchange-Community, [2023](https://arxiv.org/html/2405.17428v1#bib.bib53)), Natural language inference (Group et al., [2022](https://arxiv.org/html/2405.17428v1#bib.bib14)), SQuAD (Rajpurkar et al., [2016](https://arxiv.org/html/2405.17428v1#bib.bib48)), ArguAna (Wachsmuth et al., [2018](https://arxiv.org/html/2405.17428v1#bib.bib59)), BioASQ (Tsatsaronis et al., [2015](https://arxiv.org/html/2405.17428v1#bib.bib56)), FiQA (Maia et al., [2018](https://arxiv.org/html/2405.17428v1#bib.bib34)), FEVER (Thorne et al., [2018](https://arxiv.org/html/2405.17428v1#bib.bib55)). Typically, these datasets do not contain its own hardnegatives, necessitating the mining of such examples. To address this, we further finetune another encoder-based embedding model (Wang et al., [2022](https://arxiv.org/html/2405.17428v1#bib.bib62)) to select the hardnegatives on those datasets. Refer to Table [6](https://arxiv.org/html/2405.17428v1#A1.T6) for the number of samples used for training.
4.2 Public Non-retrieval Datasets

Besides retrieval datasets, we also utilize non-retrieval datasets from three sub-tasks in MTEB benchmark: classification, clustering and semantic similarity (STS). We pre-process these datasets to use the same format as retrieval datasets for contrastive training: instructed query qinst+ (containing query q+), positive document d+ and hard negative documents d0−,…,dn−.

We utilize the English training splits of various classification datasets from MTEB Huggingface datasets (Muennighoff et al., [2022](https://arxiv.org/html/2405.17428v1#bib.bib38); Lhoest et al., [2021](https://arxiv.org/html/2405.17428v1#bib.bib27)). The classification datasets that we use are: AmazonReviews-Classification (McAuley & Leskovec, [2013](https://arxiv.org/html/2405.17428v1#bib.bib35)), AmazonCounterfactual-Classification (O’Neill et al., [2021](https://arxiv.org/html/2405.17428v1#bib.bib43)), Banking77-Classification (Casanueva et al., [2020](https://arxiv.org/html/2405.17428v1#bib.bib7)), Emotion-Classification (Saravia et al., [2018](https://arxiv.org/html/2405.17428v1#bib.bib51)), IMDB-Classification (Maas et al., [2011](https://arxiv.org/html/2405.17428v1#bib.bib32)), MTOPIntent-Classification (Li et al., [2021](https://arxiv.org/html/2405.17428v1#bib.bib28)), ToxicConversations-Classification (Adams et al., [2019](https://arxiv.org/html/2405.17428v1#bib.bib1)), TweetSentimentExtraction-Classification (Maggie, [2020](https://arxiv.org/html/2405.17428v1#bib.bib33)).

^ eg. dataset description from NV-Embed

So far, we have primarily focused on training using the train splits of the most popular pair/triplet data from sentence-transformers, because those are the most approachable to get started:

Sentence Transformers HF

I hope to release a multilingual model eventually (or even language specific), but it will take some time to develop the appropriate data required to do it. We rely on contrastive data, since we are using MRNL loss within the sentence-transformers training framework. I am currently working on building a new dataset that I'll put on huggingface that will work in the sentence-transformers framework. Some time after that, I might spend some time looking at what it will take to train a multilingual version (probably using some of their multilingual paraphrase data).

I compare WordLlama to word embedding models like GloVe, FastText, etc. It is a token embedding model that is trained to learn token weight and project the full set of tokens down to representations of a smaller dimension, with the same size tokenizer vocabulary -- and using simple average pooling.

These are compromises we make for simplicity. The benefit is you mostly just need numpy and tokenizers for inference. It's very fast on CPU, and it's good enough for a lot of 'simple' tasks.

It is not a replacement for transformer embedding models. The performance is much closer to word embedding models than SOTA multibillion parameter models that require inference runtimes a lot of compute.

My goal is not to compete in that space, but to create a useful utility library for working with strings and text in python. To that end, most of my time right now is focused on developing fast, useful applications and not model performance. Of course that's important too, but I don't think eeking out another half percent in benchmarks among the "best, worst embedding models" is as helpful to people as writing algorithms that make it useful.

Anyway, that's my philosophy for the project. I know it's a bit confusing, because it's kind of straddling multiple spaces. Hopefully this helps.

TheMrguiller commented 1 month ago

Thank you so much for the detailed explanation, it really helps to visualize your ideas. The multilingual approach I was considering is specifically for using the token embeddings of Llama 3.1 and Qwen 2.5 models. It's great that you're focusing more on resource efficiency rather than just opting for large models. I will keep an eye on the project as it develops. Thank you for your amazing work!

jimexist commented 1 month ago

fwiw there's 3.2 release of sbert:

dleemiller commented 1 month ago

Cool! Interesting method too. In theory we could use those embeddings in wordllama. Might give it a try sometime.