I encountered a strange problem when reproducing the baseline using BM25 as the retrieval method.
Firstly, I used the dataset wiki-18.jsonl, which is downloaded from your huggingface dataset as a full document collection.
Then, when building the retrieval index, I used BM25 and E5-base-v2 methods. The main problem have occurred when I was using the bm25 one. The code to generate an index using the bm25 method and bm25s as the backend is listed below.
After indexing, I ran a benchmark to test the result. LLama3-8B-instruct is used as the generator model.
import argparse
from flashrag.config import Config
from flashrag.utils import get_dataset
from flashrag.pipeline import SequentialPipeline
from flashrag.prompt import PromptTemplate
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str)
parser.add_argument("--retriever_path", type=str)
args = parser.parse_args()
config_dict = {
"data_dir": "dataset/FlashRAG_datasets/",
"dataset_name": "nq",
"split": ["test"],
"index_path": "bm25s_indexes/bm25/",
"corpus_path": "dataset/wiki-18.jsonl",
"model2path": {"e5": "models/e5-base-v2", "llama3-8B-instruct": "models/llama-3-8b"},
"generator_model": "llama3-8B-instruct",
"retrieval_method": "bm25",
"metrics": ['em','f1','acc','precision','recall','input_tokens'] ,
"retrieval_topk": 1,
"save_intermediate_data": True,
}
config = Config(config_dict=config_dict)
all_split = get_dataset(config)
test_data = all_split["test"]
prompt_templete = PromptTemplate(
config,
system_prompt="Answer the question based on the given document. \
Only give me the answer and do not output any other words. \
\nThe following are given documents.\n\n{reference}",
user_prompt="Question: {question}\nAnswer:",
)
pipeline = SequentialPipeline(config, prompt_template=prompt_templete)
output_dataset = pipeline.run(test_data, do_eval=True)
print("---generation output by bm25---")
print(output_dataset.pred)
And the error log is:
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|█████████▌ | 1/4 [01:06<03:19, 66.41s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████| 4/4 [03:35<00:00, 53.87s/it]
Generation process: 0%| | 0/903 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
Generation process: 0%| | 1/903 [00:56<14:14:49, 56.86s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
.......
.......
.......
[19:18<00:01, 1.55s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generation process: 100%|█████████████████████████████████████████| 903/903 [19:28<00:00, 1.29s/it]
Traceback (most recent call last):
File "/scratch/yanxinyu/FlashRAG/examples/quick_start/bm25s_pipline.py", line 42, in <module>
output_dataset = pipeline.run(test_data, do_eval=True)
File "/scratch/FlashRAG/flashrag/pipeline/pipeline.py", line 130, in run
dataset = self.evaluate(dataset, do_eval=do_eval, pred_process_fun=pred_process_fun)
File "/scratch/FlashRAG/flashrag/pipeline/pipeline.py", line 37, in evaluate
eval_result = self.evaluator.evaluate(dataset)
File "/scratch/FlashRAG/flashrag/evaluator/evaluator.py", line 66, in evaluate
self.save_data(data)
File "/scratch/FlashRAG/flashrag/evaluator/evaluator.py", line 82, in save_data
data.save(save_path)
File "/scratch/FlashRAG/flashrag/dataset/dataset.py", line 160, in save
json.dump(save_data, f, indent=4)
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/__init__.py", line 179, in dump
for chunk in iterable:
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 429, in _iterencode
yield from _iterencode_list(o, _current_indent_level)
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/scratch/anaconda3/envs/FlashRAG/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type ndarray is not JSON serializable
And the result in intermediate_data.json is:
~/FlashRAG/examples/quick_start/output/nq_2024_11_18_23_22_experiment$ cat intermediate_data.json
[
{
"id": "test_0",
"question": "who got the first nobel prize in physics",
"golden_answers": [
"Wilhelm Conrad R\u00f6ntgen"
],
"output": {
I might be overlooking something fundamental, but I’ve been struggling with this issue.
Any assistance would be helpful. Thanks!
Hi,
I encountered a strange problem when reproducing the baseline using BM25 as the retrieval method.
Firstly, I used the dataset
wiki-18.jsonl
, which is downloaded from your huggingface dataset as a full document collection.Then, when building the retrieval index, I used BM25 and E5-base-v2 methods. The main problem have occurred when I was using the bm25 one. The code to generate an index using the bm25 method and bm25s as the backend is listed below.
The file generated after indexing in
~/FlashRAG/examples/quick_start/bm25s_indexes/bm25
is listed below:After indexing, I ran a benchmark to test the result.
LLama3-8B-instruct
is used as the generator model.And the error log is:
And the result in intermediate_data.json is:
I might be overlooking something fundamental, but I’ve been struggling with this issue. Any assistance would be helpful. Thanks!