Are preprocessed wiki corpus and index available?

RUC-NLPIR / FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

https://arxiv.org/abs/2405.13576

MIT License

891 stars 69 forks source link

Are preprocessed wiki corpus and index available? #2

Closed cylinbao closed 1 month ago

cylinbao commented 1 month ago

I wonder do you have the preprocessed wiki dump and index that are used in your experiments available?

I tried to test FlashRAG with my wiki dump from wiki_dpr and index with contriever. However, the accuracy I got from some of your baselines(zero-shot, naive, iter-gen) are lower than the reported numbers. I suspect it might be the misalignment of the data source and index. Therefore, it will be really helpful if you can provide the preprocessed wiki dump and index. I know we can replicate by following this, but it seems taking a long time to run.

DaoD commented 1 month ago

We are preparing the preprocessed dataset (around 30+GB, will take a lot of time for uploading to HF).

Currently, we do not have a plan to upload index fiiles as they are too large and each retriever has its own index.

We will notify you asap when the data files have been uploaded.

Best, Yutao

DaoD commented 1 month ago

@cylinbao hi, the wiki dataset we used for experiments have been uploaded (https://huggingface.co/datasets/ignore/FlashRAG_datasets/tree/main/retrieval-corpus).

ignorejjj commented 1 month ago

@cylinbao I just conducted some testing experiments and found that due to modifying some settings in subsequent development, the llama3 model may not be able to provide accurate answers (affecting actual match).

I have returned this part of the settings to the original settings (see 1e94f2633bceb74119e12ca9cea40896086b899f) and obtained normal test results.

Sorry for making mistakes. Hope it can be helpful to you.

BUAADreamer commented 1 month ago

So what's the major difference between now and then, I have tried replug and only got 0.13 em score

ignorejjj commented 1 month ago

@BUAADreamer There are two difference:

add <|eot_id|> into eos_token_id and stop in generation to let llama3 stop normally (see eda0c6a916d23405e2666b24477fcf26819996e6)
fix abnormal blanks in prompt (a little bug, see 0f8ed0a6a02097df23ddda813d3cc0d3de753f8f)

cylinbao commented 1 month ago

Thanks for the replying. I understand this is an actively project so modifications are normal. But I wonder do you plan to release the accuracy numbers with the new changes?

ignorejjj commented 1 month ago

@cylinbao Thanks for your suggestion! In the subsequent changes that may affect the results, we will re run the results of methods to evaluate how much the effect is affected.

The results we are currently reporting are actually based on the above improvements. The above improvements were present during our experiment, but were later removed by me due to some code reasons. Now it's just a rollback.