Scalable dataset generation with Wikipedia (for mid/low resource languages)

rasdani commented 7 months ago

I am currently experimenting with a scalable approach for retrieval and reranking benchmark dataset creation based on the wikimedia/wikipedia HF datasets. I want to specifically target this at mid and/or low resource languages.

The idea is to generate questions with GPT3.5-turbo and/or GPT4-turbo grounded in chunked Wikipedia articles and use the dataset as a benchmark for asymmetric retrieval/reranking common in RAG applications.

Rationale: Wikipedia is often the highest quality/most reliable corpus for languages with small footprint on the internet. Wikipedia is available in many languages, so this method should apply to many languages. Since generations will be (single sentence) questions and will be grounded in sufficient context, hallucinations should be unlikely. Short questions generated by strong LLMs grounded in actual Wikipedia articles of the target language should be a strict improvement over all the machine translated versions of SQuAD out there.

Risks: GPT generation might not perform well enough in some target languages.

For this I am

replicating Wikipedia dataset preprocessing after SQuAD and GermanQuAD
- sampling ~2000 articles from Top 10,000 PageRanked articles
- chunking them to 512 tokens of the intfloat/multilingual-e5 tokenizer (very common multilingual embeddings and coincides with context limit in GermanQuAD)
generating single sentence questions relating to a given chunk with GPT3.5/4-turbo
validating generations in a second LLM pass (GPT4-turbo), checking for relevance to context and language adherence

Currently I am validating my idea with a German dataset. Next will be English and Bengali. For Bengali I can rely on judgment of a native speaker.

Happy to hear thoughts/suggestions! :)

KennethEnevoldsen commented 7 months ago

@rasdani I believe this could be a very promising dataset. One point when creating the dataset is that it should be comparable to existing datasets out there. I believe reasonable baseline datasets to target are:

SQuAD, GermanQuAD (both high-resource), and norquad (low- to mid-resource). If we have a truly low-resource baseline that would be nice as well.

Another risk, which I believe is relevant is that the questions are not natural, i.e. that the questions are in fact not something that people would ever ask (but something that you might generate from a context). That being said if we can match human-constructed datasets for baselines (i.e. ranking of models correlate), I believe we have a strong case.

chunking them to 512 tokens of the intfloat/multilingual-e5 tokenizer (very common multilingual embeddings and coincides with context limit in GermanQuAD)

Seems like you could get more meaningful boundaries (paragraphs instead of mid-sentence) using chunking methods such as the recursive splitter in langchain.

Potentially related to #376

Related work: This tutorial I believe implements essentially what you are suggesting (with some filtering e.g. for tables and other regions poor for question generation). Though the repo is hosted by me it is actually developed by @jalkestrup. @jalkestrup is offline for the coming days, but will be back online later this week.

I have a vague suspicion that there might also be papers examining this approach (@Muennighoff, @imenelydiaker?)

Muennighoff commented 7 months ago

Yeah IIRC InPars & Promptagator also synthetically generate embedding data with LLMs

KennethEnevoldsen commented 7 months ago

Thanks @Muennighoff

Seems like they generally do it with the intent of training models rather than generating the test data (which also is an easier sell).

References: Promptagor, inPars, IIRC.

@rasdani an alternative direction you could go is, though it is probably a paper in itself (so consider it more like brainstorming):

generate questions from the wikidata graph (gold questions with known answers), choose a target LLM (small model like phi), and then the best embedding model is the one that can best find suitable Wikipedia articles to allow the LLM to answer the questions.

This tests the embedding models in a realistic use case. The bias is now of course in the embedding model - llm interaction instead of the llm question generation. Both approaches might have poor generalization to low-resource languages (even assuming the wiki content is good).

Muennighoff commented 7 months ago

Sorry with IIRC, I meant if I remember correctly so only the other two (also InPars has a v2 - https://arxiv.org/abs/2301.01820). I need to improve my communication, I'm sorry 😅😅😅

KennethEnevoldsen commented 7 months ago

No worries - the other two were very relevant

rasdani commented 7 months ago

I did a test run and generated queries over the GermanRAG dataset with our internal synthetic data pipeline.

You can find the source dataset and the resulting dataset here: https://huggingface.co/datasets/rasdani/germanrag-positives https://huggingface.co/datasets/rasdani/germanrag-positives-queries

Generations are in the "query" column.

I used a subset of GermanRAG, because I already deduplicated contexts from GermanQuAD there.

I then evaluated retrieval on both datasets and computed some correlations here.

How would you calculate correlation in retrieval performance @KennethEnevoldsen ?

They way I did it, suggests that the current approach looks reasonable.

Here is the prompt template, I used with GPT4-turbo and temperature=0.5 in our internal data gen tool:

Your task is to anticipate possible search queries by users in the form of a question for a given document.
- The question should be formulated concretely and precisely and relate to the information from the given document.
- The question must be coherent and should make sense without knowing the document.
- The question must be answerable by the document.
- The question should focus on one aspect and avoid using subclauses connected with 'and'.
Limit the questions to the information from the given document and do not draw on your prior knowledge.

Orient your question to the following examples:
<document>
Guam === Nach dem Zweiten Weltkrieg === Seit 1946 steht das Territorium auf der UN-Liste der Hoheitsgebiete ohne Selbstregierung. 1949 unterschrieb Harry S. Truman den Organic Act, ein Gesetz, das Guam zu einem externen Territorium der USA mit innerer Autonomie machte, das es bis heute geblieben ist. Ab 1962 baute die United States Navy den Hafen Apra zu einem Marinestützpunkt für die Atom-U-Boote aus, die mit strategischen Mittelstreckenraketen vom Typ UGM-27 Polaris ausgerüstet sind (SSBN). Vom 15. September 1996 bis 16. Dezember 1996 führten die USA die verdeckte Operation Pacific Haven / Quick Transit Irak-Guam durch.
</document>

Search query:
Seit wann gehört Guam zu dem Gebiet der Vereinigten Staaten?

Generate a question for the following document:
<document>
{{ document }}
</document>

Search query:

Since the LLM already picked up on which language to use, I refrained from prompting for the target language specifically. But might introduce it when using GPT3.5-turbo.

KennethEnevoldsen commented 7 months ago

Thanks for overview @rasdani, this looks very promising.

Some of my worries atm. include:

1) this might only work for high-resource language (so trying it out on norwegian or similar seems valid) 2) this seems to only be the performance scores of one model, I would love to see the scores of maybe 10 models (selected across the spectrum) and see how well the scores correlate between the two datasets (performance of gold standard dataset on the X axis and constructed dataset on the y-axis)

x-tabdeveloping commented 7 months ago

We could also generate clustering datasets from Wikipedia in scarce resource languages by traversing the category hierarchy. I've done this before and have some code lying around if you're interested.

KennethEnevoldsen commented 7 months ago

I believe #376 is what you are looking for ;)

rasdani commented 7 months ago

hey, @KennethEnevoldsen here's my update on correlation between human gold dataset and my synthetic queries :)

I tested with 10 embeddings for which I had the config and setup ready. MRR@10 on germanrag-positives-queries which contains gpt-4-turbo generations with the above mentioned prompt.

Here's the script and the notebook: https://github.com/rasdani/mmteb-wiki/blob/main/mmteb_wiki/correlation_study.py https://github.com/rasdani/mmteb-wiki/blob/main/mmteb_wiki/plots.ipynb

I ran the same generation pipeline with gpt-3.5-turbo and the queries look comparably good to gpt-4-turbo upon manual inspection. Will do the same correlation study for these, too.

I think I will tackle norquad next. I assume you are a native speaker of Norwegian? If so, this would be a nice validation (low/mid-resource example + your judgment).

Then I would try my data gen setup with wikipedia instead of relying on SQuAD variants. And at some point will do Bengali with eyeballing by a native speaker.

EDIT: I noticed you're based in Denmark. Is norquad still ok for you, eyeballing it?

KennethEnevoldsen commented 7 months ago

Thanks for taking the time with this. Very happy to see the correlation, seems like a close-to-perfect ranked correlation - at least if you only consider significant rank (i.e. very similarly performing models should when bootstrapped perform similarly so switching up those two models shouldn't be considered an error).

To do those calculations, however, we would need to implement bootstrapping in the evaluations (shouldn't be too hard)

I think I will tackle norquad next. I assume you are a native speaker of Norwegian?

Sadly not, but I speak Danish which is closely related so feel free to forward any questions you have. I also don't mind taking a look at the data (most Danes can read Norwegian with some effort).

rasdani commented 7 months ago

Added bootstrapped rank correlation to the notebook.

Mean Spearman Rank Correlation: 0.9302937560901587
95% Confidence Interval: [0.69220779 1.        ]

Looks good, doesn't it? @KennethEnevoldsen

embeddings-benchmark / mteb

Scalable dataset generation with Wikipedia (for mid/low resource languages) #378