RUC-NLPIR / FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research
https://arxiv.org/abs/2405.13576
MIT License
1.38k stars 114 forks source link

Error When Reproducting the Baseline by Using BM25 as the Retrieval Method #102

Closed YANthinkn closed 1 week ago

YANthinkn commented 1 week ago

Hi,

I encountered a strange problem when reproducing the baseline using BM25 as the retrieval method.

Firstly, I used the dataset wiki-18.jsonl, which is downloaded from your huggingface dataset as a full document collection.

Then, when building the retrieval index, I used BM25 and E5-base-v2 methods. The main problem have occurred when I was using the bm25 one. The code to generate an index using the bm25 method and bm25s as the backend is listed below.

python -m flashrag.retriever.index_builder \
    --retrieval_method bm25 \
    --corpus_path dataset/wiki-18.jsonl \
    --bm25_backend bm25s \
    --save_dir bm25s_indexes/ 

The file generated after indexing in ~/FlashRAG/examples/quick_start/bm25s_indexes/bm25 is listed below:

corpus.jsonl         data.csc.index.npy     indptr.csc.index.npy  vocab.index.json
corpus.mmindex.json  indices.csc.index.npy  params.index.json

After indexing, I ran a benchmark to test the result. LLama3-8B-instruct is used as the generator model.

import argparse
from flashrag.config import Config
from flashrag.utils import get_dataset
from flashrag.pipeline import SequentialPipeline
from flashrag.prompt import PromptTemplate

parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str)
parser.add_argument("--retriever_path", type=str)
args = parser.parse_args()

config_dict = {
    "data_dir": "dataset/FlashRAG_datasets/",
    "dataset_name": "nq",
    "split": ["test"],
    "index_path": "bm25s_indexes/bm25/",
    "corpus_path": "dataset/wiki-18.jsonl",
    "model2path": {"e5": "models/e5-base-v2", "llama3-8B-instruct": "models/llama-3-8b"},
    "generator_model": "llama3-8B-instruct",
    "retrieval_method": "bm25",
    "metrics": ['em','f1','acc','precision','recall','input_tokens'] ,
    "retrieval_topk": 1,
    "save_intermediate_data": True,
}

config = Config(config_dict=config_dict)

all_split = get_dataset(config)
test_data = all_split["test"]
prompt_templete = PromptTemplate(
    config,
    system_prompt="Answer the question based on the given document. \
                    Only give me the answer and do not output any other words. \
                    \nThe following are given documents.\n\n{reference}",
    user_prompt="Question: {question}\nAnswer:",
)

pipeline = SequentialPipeline(config, prompt_template=prompt_templete)

output_dataset = pipeline.run(test_data, do_eval=True)
print("---generation output by bm25---")
print(output_dataset.pred)

And the error log is:

Loading checkpoint shards:   0%|                                              | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|█████████▌                            | 1/4 [01:06<03:19, 66.41s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████| 4/4 [03:35<00:00, 53.87s/it]
Generation process:   0%|                                                   | 0/903 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
Generation process:   0%|                                        | 1/903 [00:56<14:14:49, 56.86s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
.......
.......
.......

[19:18<00:01,  1.55s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generation process: 100%|█████████████████████████████████████████| 903/903 [19:28<00:00,  1.29s/it]
Traceback (most recent call last):
  File "/scratch/yanxinyu/FlashRAG/examples/quick_start/bm25s_pipline.py", line 42, in <module>
    output_dataset = pipeline.run(test_data, do_eval=True)
  File "/scratch/FlashRAG/flashrag/pipeline/pipeline.py", line 130, in run
    dataset = self.evaluate(dataset, do_eval=do_eval, pred_process_fun=pred_process_fun)
  File "/scratch/FlashRAG/flashrag/pipeline/pipeline.py", line 37, in evaluate
    eval_result = self.evaluator.evaluate(dataset)
  File "/scratch/FlashRAG/flashrag/evaluator/evaluator.py", line 66, in evaluate
    self.save_data(data)
  File "/scratch/FlashRAG/flashrag/evaluator/evaluator.py", line 82, in save_data
    data.save(save_path)
  File "/scratch/FlashRAG/flashrag/dataset/dataset.py", line 160, in save
    json.dump(save_data, f, indent=4)
  File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 429, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type ndarray is not JSON serializable

And the result in intermediate_data.json is:

~/FlashRAG/examples/quick_start/output/nq_2024_11_18_23_22_experiment$ cat intermediate_data.json 
[
    {
        "id": "test_0",
        "question": "who got the first nobel prize in physics",
        "golden_answers": [
            "Wilhelm Conrad R\u00f6ntgen"
        ],
        "output": {

I might be overlooking something fundamental, but I’ve been struggling with this issue. Any assistance would be helpful. Thanks!

ignorejjj commented 1 week ago

Thanks for pointing out the issue. It is indeed a bug that caused problems when saving json. We have fixed this in the latest version.

YANthinkn commented 1 week ago

Thank you for getting back to me. I will try the newest version asap :-).