I encountered a strange problem when reproducing the baseline using BM25 as the retrieval method.
Firstly, I used the dataset wiki-18.jsonl, which is downloaded from your huggingface dataset as a full document collection.
Then, when building the retrieval index, I used BM25 and E5-base-v2 methods. The main problem have occurred when I was using the bm25 one. The code to generate an index using the bm25 method and bm25s as the backend is listed below.
After indexing, I ran a benchmark to test the result. LLama3-8B-instruct is used as the generator model.
import argparse
from flashrag.config import Config
from flashrag.utils import get_dataset
from flashrag.pipeline import SequentialPipeline
from flashrag.prompt import PromptTemplate
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str)
parser.add_argument("--retriever_path", type=str)
args = parser.parse_args()
config_dict = {
"data_dir": "dataset/FlashRAG_datasets/",
"dataset_name": "nq",
"split": ["test"],
"index_path": "bm25s_indexes/bm25/",
"corpus_path": "dataset/wiki-18.jsonl",
"model2path": {"e5": "models/e5-base-v2", "llama3-8B-instruct": "models/llama-3-8b"},
"generator_model": "llama3-8B-instruct",
"retrieval_method": "bm25",
"metrics": ['em','f1','acc','precision','recall','input_tokens'] ,
"retrieval_topk": 1,
"save_intermediate_data": True,
config = Config(config_dict=config_dict)
all_split = get_dataset(config)
test_data = all_split["test"]
prompt_templete = PromptTemplate(
system_prompt="Answer the question based on the given document. \
Only give me the answer and do not output any other words. \
\nThe following are given documents.\n\n{reference}",
user_prompt="Question: {question}\nAnswer:",
pipeline = SequentialPipeline(config, prompt_template=prompt_templete)
output_dataset = pipeline.run(test_data, do_eval=True)
print("---generation output by bm25---")
And the error log is:
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|█████████▌ | 1/4 [01:06<03:19, 66.41s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████| 4/4 [03:35<00:00, 53.87s/it]
Generation process: 0%| | 0/903 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
Generation process: 0%| | 1/903 [00:56<14:14:49, 56.86s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
[19:18<00:01, 1.55s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generation process: 100%|█████████████████████████████████████████| 903/903 [19:28<00:00, 1.29s/it]
Traceback (most recent call last):
File "/scratch/yanxinyu/FlashRAG/examples/quick_start/bm25s_pipline.py", line 42, in <module>
output_dataset = pipeline.run(test_data, do_eval=True)
File "/scratch/FlashRAG/flashrag/pipeline/pipeline.py", line 130, in run
dataset = self.evaluate(dataset, do_eval=do_eval, pred_process_fun=pred_process_fun)
File "/scratch/FlashRAG/flashrag/pipeline/pipeline.py", line 37, in evaluate
eval_result = self.evaluator.evaluate(dataset)
File "/scratch/FlashRAG/flashrag/evaluator/evaluator.py", line 66, in evaluate
File "/scratch/FlashRAG/flashrag/evaluator/evaluator.py", line 82, in save_data
File "/scratch/FlashRAG/flashrag/dataset/dataset.py", line 160, in save
json.dump(save_data, f, indent=4)
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/__init__.py", line 179, in dump
for chunk in iterable:
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 429, in _iterencode
yield from _iterencode_list(o, _current_indent_level)
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type ndarray is not JSON serializable
And the result in intermediate_data.json is:
~/FlashRAG/examples/quick_start/output/nq_2024_11_18_23_22_experiment$ cat intermediate_data.json
"id": "test_0",
"question": "who got the first nobel prize in physics",
"golden_answers": [
"Wilhelm Conrad R\u00f6ntgen"
"output": {
I might be overlooking something fundamental, but I’ve been struggling with this issue.
Any assistance would be helpful. Thanks!
I encountered a strange problem when reproducing the baseline using BM25 as the retrieval method.
Firstly, I used the dataset
, which is downloaded from your huggingface dataset as a full document collection.Then, when building the retrieval index, I used BM25 and E5-base-v2 methods. The main problem have occurred when I was using the bm25 one. The code to generate an index using the bm25 method and bm25s as the backend is listed below.
The file generated after indexing in
is listed below:After indexing, I ran a benchmark to test the result.
is used as the generator model.And the error log is:
And the result in intermediate_data.json is:
I might be overlooking something fundamental, but I’ve been struggling with this issue. Any assistance would be helpful. Thanks!