RUC-NLPIR / FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research
https://arxiv.org/abs/2405.13576
MIT License
891 stars 69 forks source link

Problem in replicating results in Wiki Data #42

Closed cyanmishra92 closed 1 week ago

cyanmishra92 commented 1 week ago

Hi. As a first step we were able to run the example code. However, now we are moving to the wiki data and trying to run the wiki data-based experiments. Towards this, we have downloaded the corpus and flat index from huggingface (https://huggingface.co/datasets/ignore/FlashRAG_datasets/tree/main/retrieval-corpus) .

As we are modifying the given simple_pipeline.py to fit to the wiki data, we are facing some trouble. Can you please help us out?

For your reference, we just did the following changes:

In the simple pipeline we just changed the corpus and index path to the downloaded corpus and index:

config_dict = { 'data_dir': 'dataset/', 'index_path': 'indexes/wiki18_100w_e5_flat.index', 'corpus_path': 'indexes/wiki18_100w.jsonl', 'model2path': {'e5': args.retriever_path, 'llama3-8B-instruct': args.model_path}, 'generator_model': 'llama3-8B-instruct', 'retrieval_method': 'e5', 'metrics': ['em','f1','sub_em'], 'retrieval_topk': 1, 'save_intermediate_data': True }

We are facing a weird error: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Find reference in template Find question in template Traceback (most recent call last): File "/data/wikimedia/FlashRAG/FlashRAG/examples/quick_start/wiki_pipeline.py", line 35, in pipeline = SequentialPipeline(config, prompt_template=prompt_templete) File "/data/wikimedia/FlashRAG/FlashRAG/flashrag/pipeline/pipeline.py", line 54, in init self.retriever = get_retriever(config) File "/data/wikimedia/FlashRAG/FlashRAG/flashrag/utils/utils.py", line 76, in get_retriever return getattr( File "/data/wikimedia/FlashRAG/FlashRAG/flashrag/retriever/retriever.py", line 230, in init self.index = faiss.read_index(self.index_path) File "/csmishra/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 12365, in read_index return _swigfaiss_avx2.read_index(args) RuntimeError: Error in faiss::Index faiss::read_index(faiss::IOReader*, int) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244517602/work/faiss/impl/index_read.cpp:1053: Index type 0x61746164 ("data") not recognized

Can someone please help us out here?

--Regards Cyan

ignorejjj commented 1 week ago

Hello. Your code is correct, and we are wondering if there may have some issues when uploading the index to huggingface, and we will conduct some test later.

You can also try building your own index first, but you mentioned in the email that you encountered some problems. Can you provide the corresponding code and error?

cyanmishra92 commented 1 week ago

Hi. I have tried building my own index. I used the given script python -m flashrag.retriever.index_builder \ --retrieval_method e5 \ --model_path /FlashRAG/FlashRAG/flashrag/retriever/model/e5-base-v2 \ --corpus_path /FlashRAG/FlashRAG/examples/quick_start/indexes/wiki18_100w.jsonl \ --save_dir /FlashRAG/FlashRAG/examples/quick_start/indexes/ \ --use_fp16 \ --max_length 256 \ --batch_size 512 \ --pooling_method mean \ --faiss_type Flat \ --save_embedding

I think an embedding was created with name: emb_e5.memmap (61G size). I am not sure why it did not generate a .index file. Can you clarify this?

Secondly. I also ran the same code with wiki data and the emb_e5.memmap index.

config_dict = { 'data_dir': 'dataset/', 'index_path': 'indexes/emb_e5.memmap', 'corpus_path': 'indexes/wiki18_100w.jsonl', 'model2path': {'e5': args.retriever_path, 'llama3-8B-instruct': args.model_path}, 'generator_model': 'llama3-8B-instruct', 'retrieval_method': 'e5', 'metrics': ['em','f1','sub_em'], 'retrieval_topk': 1, 'save_intermediate_data': True }

We got some errors:

$ python wiki_pipeline.py --model_path /FlashRAG/FlashRAG/flashrag/generator/model/Meta-Llama-3-8B-Instruct/ --retriever_path /FlashRAG/FlashRAG/flashrag/retriever/model/e5-base-v2 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Find reference in template Find question in template Traceback (most recent call last): File "/FlashRAG/FlashRAG/examples/quick_start/wiki_pipeline.py", line 35, in pipeline = SequentialPipeline(config, prompt_template=prompt_templete) File "/FlashRAG/FlashRAG/flashrag/pipeline/pipeline.py", line 54, in init self.retriever = get_retriever(config) File "/FlashRAG/FlashRAG/flashrag/utils/utils.py", line 76, in get_retriever return getattr( File "/FlashRAG/FlashRAG/flashrag/retriever/retriever.py", line 230, in init self.index = faiss.read_index(self.index_path) File "/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 12365, in read_index return _swigfaiss_avx2.read_index(args) RuntimeError: Error in faiss::Index faiss::read_index(faiss::IOReader*, int) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244517602/work/faiss/impl/index_read.cpp:1053: Index type 0x3c69e000 ("\x00\xe0i<") not recognized

I believe, this is because the right index file was not created. Can you let me know if I am making any mistake here?

Thanks a lot

Cyan

ignorejjj commented 1 week ago

After saving embedding file, the command "Creating index" will be printed on the console, and after waiting for a period of time, the index will be established. You may have manually interrupted this process?

You can build an index based on the embedding file, so you don't have to repeat the previous steps:

python -m flashrag.retriever.index_builder
--retrieval_method e5
--model_path /FlashRAG/FlashRAG/flashrag/retriever/model/e5-base-v2
--corpus_path /FlashRAG/FlashRAG/examples/quick_start/indexes/wiki18_100w.jsonl
--save_dir /FlashRAG/FlashRAG/examples/quick_start/indexes/
--use_fp16
--max_length 256
--batch_size 512
--pooling_method mean
--faiss_type Flat
--embedding_path indexes/emb_e5.memmap
cyanmishra92 commented 1 week ago

Hi. Thanks again for the quick response. Really appriciate it. Although I have not interrupted the process, but it might have happened in the backend. So, you might be absolutely right. Let me try the index rebuilding process.

Just to let you know that I was using a multiGPU setup, and I dont think that would have any issue. However, I was using screen to detach the process from my terminal so that it doesnot terminate accidentally, and that might have lead to some issues.

Let me try all the steps again 1. the method you suggested and 2. rebuilding from scratch and get back to you.

--Cyan

cyanmishra92 commented 1 week ago

Hi. I was able to buildthe index. I am uploading the index to the drive and will give you a copy as well.

Now, I am trying to run the inference with the same basic pipeline example code: config_dict = { 'data_dir': 'dataset/', 'index_path': 'indexes/e5_Flat.index', 'corpus_path': 'indexes/wiki18_100w.jsonl', 'model2path': {'e5': args.retriever_path, 'llama3-8B-instruct': args.model_path}, 'generator_model': 'llama3-8B-instruct', 'retrieval_method': 'e5', 'metrics': ['em','f1','sub_em'], 'retrieval_topk': 1, 'save_intermediate_data': True }

And I am getting cuda out of memeory error. I have tried with a single GPU (A100) to 4 GPUS (4xA100). And the error is:

python wiki_pipeline.py --model_path /FlashRAG/FlashRAG/flashrag/generator/model/Meta-Llama-3-8B-Instruct/ --retriever_path /FlashRAG/FlashRAG/flashrag/retriever/model/e5-base-v2 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Find reference in template Find question in template Loading dataset shards: 100%|██████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 581.68it/s] Traceback (most recent call last): File "/FlashRAG/FlashRAG/examples/quick_start/wiki_pipeline.py", line 35, in pipeline = SequentialPipeline(config, prompt_template=prompt_templete) File "/FlashRAG/FlashRAG/flashrag/pipeline/pipeline.py", line 54, in init self.retriever = get_retriever(config) File "/FlashRAG/FlashRAG/flashrag/utils/utils.py", line 76, in get_retriever return getattr( File "/FlashRAG/FlashRAG/flashrag/retriever/retriever.py", line 247, in init self.encoder = Encoder( File "/FlashRAG/FlashRAG/flashrag/retriever/encoder.py", line 49, in init self.model, self.tokenizer = load_model(model_path=model_path, File "/FlashRAG/FlashRAG/flashrag/retriever/utils.py", line 13, in load_model model.cuda() File "/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2694, in cuda return super().cuda(*args, **kwargs) File "/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/torch/nn/modules/module.py", line 915, in cuda return self._apply(lambda t: t.cuda(device)) File "/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/torch/nn/modules/module.py", line 779, in _apply module._apply(fn) File "/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/torch/nn/modules/module.py", line 779, in _apply module._apply(fn) File "/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/torch/nn/modules/module.py", line 779, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/torch/nn/modules/module.py", line 804, in _apply param_applied = fn(param) File "/softwares/anaconda/envs/flashRAG/lib/python3.9/site-packages/torch/nn/modules/module.py", line 915, in return self._apply(lambda t: t.cuda(device)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU

Can you let me know if you have seen this before?

Thnaks again for the quick helps

--Cyan

ignorejjj commented 1 week ago

This looks very strange, if the GPU is not occupied by other programs, running this is more than enough. After running (even if an error occurs midway), a folder will be created that stores a yaml file that records parameters. Can you show me the yaml file content?

cyanmishra92 commented 1 week ago

Yeah. It appear strange to me too. I have taken the wiiki jsonl file and created the index with the first 100k lines. It works perfectly with my 4090 GPU. However, with the full index running on A100, I am getting this error.

Thanks for the pointer to the yaml file. Here is the detailed yaml file. Attaching it as well

corpus_path: indexes/wiki18_100w.jsonl data_dir: dataset/ dataset_name: nq dataset_path: dataset/nq device: !!python/object/apply:torch.device

configyaml.zip

ignorejjj commented 1 week ago

This retriever init logic involves loading the faiss index, corpus, and model separately. From the error message, it appears that you encountered an error while loading the model. But strangely, you haven't used the faiss gpu, which means the first two steps won't take up gpu memory.

Can you try monitoring the GPU's memory usage during this process? I guess it's possible that Faiss used GPU?

cyanmishra92 commented 1 week ago

It is possible that Faiss uses GPU. I have the Faiss GPU version installed. I will monitor the GPU memory usage during the process and update you.

Do you think I should move to Faiss CPU for this as a first step?

ignorejjj commented 1 week ago

Actually, it's not necessary. We provide the option to use faiss CPU/GPU in our config settings. And the default setting is to use CPU.

But there may also be some bugs. By monitoring the memory, this should be able to be determined.

cyanmishra92 commented 1 week ago

I tried the GPU memory monitoring. But the moemry utilization stayed 0 throughout the execution (until the error came). I used pynvml to monitor the GPU usage. So, as you suspected earlier, it is not using any GPU memory.

ignorejjj commented 1 week ago

If the GPU's memory usage is always 0 before loading the model, then a receiver model cannot directly go out of memory. Please make sure that the GPUs you are using do not have any memory usage.

cyanmishra92 commented 1 week ago

Let me try a fresh run on a different machine in my morning time. If I could not get it fixed, I'll get back to you. For now, I belive, it is the problem with the hardware environment. Thanks for all the help. I'll close this issue now and open a new one if needed.

I must say - awesome help and very quick response.. 👍 💯

cyanmishra92 commented 1 week ago

Hi,

I figured out the issue. Simple issue in the config files.

The def_init_device() function in the flashrag/config/config.py file is overriding the system default CUDA_VISIBLE_DEVICES env variable.

def _init_device(self):
    gpu_id = self.final_config['gpu_id']
    if gpu_id is not None:
        os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
        import torch
        self.final_config['device'] = torch.device('cuda')
    else:
        import torch
        self.final_config['device'] = torch.device('cpu')

I believe this can be fixed by passing the right gpu_id or modifying the flashrag/config/basic_config.yaml file in line 31:

gpu_id: "0,1,2,3"

Can you confirm this? If this is true, can you let me know if we can pass the gpu_id as a command line param somehow, or change the yaml file while running?

-- Regards,
Cyan

ignorejjj commented 1 week ago

You can just add gpu_id into your config_dict and set the correct value. It will replace the default value.

cyanmishra92 commented 1 week ago

Yes. I just added that. It works now. :100: Thanks for all the help and support...

--Regards Cyan