We have a new project involving multilingual retrieval and reproduction and we are looking for 2 URA students to work together.

Feel free to reach out on Slack or email us at nandant@gmail.com, xzhangbx@gmail.com.

Synthetic data-based multilingual LLM retrieval models

25th March 2024
Supervised by: Nandan Thakur Crystina, Xinyu ZHANG
Working Style: Weekly Sync-up Meetings (Slack for urgent/code debugging)

OVERVIEW

SWIM-X models have recently shown to be great in cross-lingual and multilingual retrieval settings (source). However, Encoder-only models (mT5) are restricted by a context length of ~ 512 tokens. They require large amounts of synthetic training data for fine-tuning + pre-training for retrieval, making extending across ~101 languages difficult.
The first line of work would be to benchmark SWIM-X in Pyserini and make reproducible baselines; as a warm-up to get familiar with the existing models and the datasets.

RELATED WORK

mE5-mistral-7B (source) are recently introduced multilingual Mistral-7B LLM-based decoder models. However, the training dataset is unavailable and the model uses a high amount of high-quality synthetic training data from GPT-4. Our work will focus more on efficient fine-tuning using a smaller subset of multilingual training data.

Research Questions

Baseline: Reproduce SWIM-X (source) and push for 2CR within Pyserini/Anserini.
Compare SWIM-X repro against other multilingual retrieval LLMs such as mE5-mistral-7B/Cohere Command-R.

Future Scope

Further, we would like to examine multilingual LLMs (in contrast to the mT5 model as in SWIM-X) using a (small) few-shot synthetic-only training dataset. Would we still require a large training dataset such as SWIM-X? Or would a few-shot examples be enough for multilingual retrieval-based LLM? How do we extend the model across 101 languages in mC4?
Explore the best approach to fine-tune LLM-based retrieval models such as Gemma-2b/Mistral-7b-v0.2 on the SWIM-IR dataset.
Research on the optimal number of (lowest) synthetic training data pairs is required.

Resources

Tevatron/LoRA for fine-tuning LLMs for retrieval: GitHub
MTEB/MIRACL for retrieval evaluation: GitHub
SWIM-IR datasets: GitHub

MILESTONES

Reproduce SWIM-X models in Pyserini (M1)
Reproduce the SWIM-X models, create 2CR checkpoints for MIRACL, and include them in Pyserini. Reproduce evaluation on XOR-RETRIEVE and XTREME-UP.
Familiarize with LLM Retrieval Fine-tuning (M2)
- Run experiments to reproduce the rank llama example: (GitHub) and use that as an example to extend Gemma-2b/Mistral-7b on multilingual retrieval datasets; either synthetic (SWIM-IR), human-labeled (MIRACL) or translated (mMARCO).

FUTURE MILESTONES

Few-shot LLM Retrieval Fine-tuning
- Depending on results in M2, further extend models to fine-tuning only a few-shot examples in different languages (Idea similar to SETFIT GitHub). Find an optimal number of training examples required in each language.
Extending Multilingual LLMs to 101 languages (M4)
- If M3 works out successfully, we can generate synthetic datasets for 101 languages (overlap with the same languages in mC4) and fine-tune a multilingual LLM across 101 languages.

RELEVANT READING MATERIAL

SWIM-IR (source)
MIRACL (source)
RankLLAMA/RepLLAMA (source)
Multilingual E5 embeddings (source)

castorini / ura-projects

Repro: Synthetic multilingual retrieval models on MIRACL #36