AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
2.95k stars 201 forks source link

Multi Language Indexing & Retrieving Support? #65

Open YupingL opened 8 months ago

YupingL commented 8 months ago

Great work! I noticed that indexing and retrieving Chinese or Japanese documents shows low accuracy, is there any tricks to improve the performance without fine-tuning?

bclavie commented 8 months ago

Hey!

ColBERT is essentially just a family of models, the examples use the original ColBERTv2 which is English-only.

There's a lot of interest in multilingual models and I'm hoping to be able to make that happen eventually, and I know other people are also working on it. Supporting more languages with ColBERT models would be fantastic! I'll leave this issue open as Help Wanted!

As of right now, I'm aware of these models to support non-English languages:

hiepxanh commented 8 months ago

the link you provide on jaColBERT not working:

The model is trained on the japanese split of MMARCO, augmented with hard negatives. The data, including the hard negatives, is available on huggingface datasets.

What is the suggestion do you have if I want to training on new languge? can you share your experience? I hope you can write down some steps so I can follow. I'm happy to introduce ColBERT new language if I can sucessful train it

hiepxanh commented 8 months ago

@bclavie I forgot to meantion, how long and how much cost did you spent to train it? If you doing on home device, what is your specs?

bclavie commented 8 months ago

the link you provide on jaColBERT not working

Oh good catch! Updated it, the proper link is here

What is the suggestion do you have if I want to training on new languge? can you share your experience? I hope you can write down some steps so I can follow. I'm happy to introduce ColBERT new language if I can sucessful train it

There's some information on the training in the technical report . Otherwise, it should be pretty straightforward to train a ColBERT using the utils in RAGatouille, since training JaColBERT is what led to writing the lib! The training utilities handle hard negative mining, etc... for you

I forgot to meantion, how long and how much cost did you spent to train it? If you doing on home device, what is your specs?

I was lucky to get some GPU credits, so trained it on 8 Nvidia L4 GPU for around ~10 hours. I'm pretty sure you could still get decent results with less data and weaker hardware!

adrienB134 commented 8 months ago

Hey!

I had a go at training ColBERT for the spanish language a few weeks ago, unfortunately I still haven't had time to properly evaluate it, but if anyone wants to try it, it can be found here: AdrienB134/ColBERTv1.0-bert-based-spanish-mmarcoES

labdmitriy commented 7 months ago

Hello @bclavie,

Maybe this one is interesting: ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval https://github.com/ant-louis/xm-retrievers https://huggingface.co/antoinelouis/colbert-xm/

4entertainment commented 4 months ago

How can i fine-tune colbert-xm model with using RAGatouille library? Give me code examples please.

SunLemuria commented 1 month ago

colbert-xm(https://huggingface.co/antoinelouis/colbert-xm/) can be loaded using RAGatouille, which has multi-language supports

RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbert-xm")

results = RAG.rerank(query, documents, k)
franperic commented 1 month ago

Hi @bclavie, many thanks for your research and development effort - I appreciate it a lot!

I would like to train a German ColBERT model. At the moment i am a bit confused regarding data format for training.

From my understanding ColBERTv2 & JaColBERT are using n-way triplets with scores from a reranker. But how are the scores fed to the model? Studying the repo I was not able to find an answer.

I would have prepared the data in the following way:

raw_query = [
    (query, (positive_passage, positive_score) , [(negative_passage1, negative_score1), (negative_passage2, negative_score2), ...])
]

It would be very helpful if you could clarify this.

franperic commented 1 month ago

Is it possible that ragatouille does not support ColBERTv2 style training at the moment?

I found this code in the ColBERT class, where the nway parameter is hardcoded to 2.

    def train(self, data_dir, training_config: ColBERTConfig):
        training_config = ColBERTConfig.from_existing(self.config, training_config)
        training_config.nway = 2
        with Run().context(self.run_config):

https://github.com/AnswerDotAI/RAGatouille/blob/main/ragatouille/models/colbert.py#L443

adrienB134 commented 1 month ago

You are correct, right now Colbert v2 style of training isn't supported in ragatouille (I think it might be very soon though).

I trained a german ColBERT v1 a while back, here.

If you are planning to train a ColBERTv2 using the MMARCO dataset, I can give you the code I used for the french & spanish version I made.

bclavie commented 1 month ago

@adrienB134 is correct -- v2 training isn't supported yet.

"very soon" might be an overstatement because I'm still dealing w/ health issues and juggling more projects than I really should at the moment, but I've set myself an arbitrary deadline of mid october AT THE LATEST to submit ragatouille as a Demo paper to ECIR25, so it should be here in september as I hate missing arbitrary targets 😄

franperic commented 1 month ago

@adrienB134 - thank you for your inputs - i will check your german colbertv1 model! I would appreciate it a lot if you could share your code for your french and spanish models.

@bclavie - thank you for the update! I hope you get well soon!

adrienB134 commented 1 month ago

@franperic Sorry it took me a while to answer, here is the code I used to train ColBERTv2 models. It reuses the ranking done for the english mmarco dataset but for other languages as they are just translations from the english version.

If you want to use your own dataset, you might want to check out the PyLate package, I haven't tried it (yet) but it looks very promising for "easy" ColBERT training.

@bclavie September sounds "very soon" enough to me ;)

bclavie commented 1 month ago

I also highly recommend PyLate! I've been somewhat involved with the development in helping with feedback/sanity-check, and it's decently likely that future RAGatouille training will use it as the backend anyway.