Marker-Inc-Korea / AutoRAG

AutoML tool for RAG
https://auto-rag.com/
Apache License 2.0
2.5k stars 192 forks source link

[BUG] Running evaluate on generated qa set gives error that retrieval_gt doc_id not in corpus, but it is in the corpus #630

Closed dividor closed 2 months ago

dividor commented 2 months ago

Describe the bug I generated a QA set ..

qa_df = pd.read_parquet(QA_FILE)
display(qa_df)
qid | query | retrieval_gt | generation_gt
-- | -- | -- | --
601d49ed-df3a-4b4b-a889-c4daaf4fc589 | What key insights and lessons are available to improve school-based programming for enhanced school health and nutrition outcomes??? | [[**e18c6855-f1c8-40ff-b9c5-ea1a0ab2f639**, e9208f70-c4b2-4d18-b942-58b59cb3831f, 78fa60b3-fe56-4d74-9cc2-c1fedd5fd4f5]]

Note the retrieval_gt doc_id e18c6855-f1c8-40ff-b9c5-ea1a0ab2f639.

Checking the corpus ...

corpus_df = pd.read_parquet(CORPUS_FILE)
display(corpus_df[corpus_df['doc_id'] == 'e18c6855-f1c8-40ff-b9c5-ea1a0ab2f639'])
doc_id | contents | metadata
-- | -- | --
**e18c6855-f1c8-40ff-b9c5-ea1a0ab2f639** | and healthy practices. There is also an increase in (a) demonstration of new teaching techniques \nand tools by the teachers (b) attentiveness and attendance of the students (c) discourse on \nimproving quality of education. With regard to health and dietary practices, there is an increase \nin the dietary diversity score in intervention schools from baseline (4.15) to midterm (5.49) due \nto an increased demand from st udents for nutritious food as a result of awareness generation \ninitiatives of the SFP. \n19 However, an unintended negative impact was found as a result of the programme wherein, the \nabsence of a proper waste disposal mechanism for the plastic wrappers of biscuits, resulted in \nschools burning the wrappers in open instead of disposin g them responsibly. This poses a major \nenvironmental concern and requires immediate correction . \n20 Consi dering that there is a growing realization on the importance of education,

It exists. However, when I run the evaluate with ...

PROJECT_DIR = f"{DATA_DIR}/autorag"
AUTORAG_CONFIG='./autorag_openai.yaml'

if not os.path.exists(PROJECT_DIR):
    os.makedirs(PROJECT_DIR)

evaluator = Evaluator(qa_data_path=QA_FILE, corpus_data_path=CORPUS_FILE,
                      project_dir=PROJECT_DIR)

evaluator.start_trial(AUTORAG_CONFIG)

I get an error saying this doc_id doesn't exist ...

ValueError: doc_id: e18c6855-f1c8-40ff-b9c5-ea1a0ab2f639 not found in corpus_data.

I have restarted the kernel, so that it's reading everything from the file system, as shown above, same error.

To Reproduce See above.

Expected behavior Should work. :)

Full Error log

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], [line 10](vscode-notebook-cell:?execution_count=10&line=10)
      [5](vscode-notebook-cell:?execution_count=10&line=5)     os.makedirs(PROJECT_DIR)
      [7](vscode-notebook-cell:?execution_count=10&line=7) evaluator = Evaluator(qa_data_path=QA_FILE, corpus_data_path=CORPUS_FILE,
      [8](vscode-notebook-cell:?execution_count=10&line=8)                       project_dir=PROJECT_DIR)
---> [10](vscode-notebook-cell:?execution_count=10&line=10) evaluator.start_trial(AUTORAG_CONFIG)

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/evaluator.py:98, in Evaluator.start_trial(self, yaml_path)
     [96](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/evaluator.py:96)         previous_result = self.qa_data
     [97](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/evaluator.py:97)     logger.info(f'Running node line {node_line_name}...')
---> [98](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/evaluator.py:98)     previous_result = run_node_line(node_line, node_line_dir, previous_result)
    [100](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/evaluator.py:100)     trial_summary_df = self._append_node_line_summary(node_line_name, node_line_dir, trial_summary_df)
    [102](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/evaluator.py:102) trial_summary_df.to_csv(os.path.join(self.project_dir, trial_name, 'summary.csv'), index=False)

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/node_line.py:45, in run_node_line(nodes, node_line_dir, previous_result)
     [43](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/node_line.py:43) summary_lst = []
     [44](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/node_line.py:44) for node in nodes:
---> [45](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/node_line.py:45)     previous_result = node.run(previous_result, node_line_dir)
     [46](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/node_line.py:46)     node_summary_df = load_summary_file(os.path.join(node_line_dir, node.node_type, 'summary.csv'))
     [47](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/node_line.py:47)     best_node_row = node_summary_df.loc[node_summary_df['is_best']]

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/schema/node.py:57, in Node.run(self, previous_result, node_line_dir)
     [55](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/schema/node.py:55) logger.info(f'Running node {self.node_type}...')
     [56](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/schema/node.py:56) input_modules, input_params = self.get_param_combinations()
---> [57](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/schema/node.py:57) return self.run_node(modules=input_modules,
     [58](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/schema/node.py:58)                      module_params=input_params,
     [59](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/schema/node.py:59)                      previous_result=previous_result,
     [60](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/schema/node.py:60)                      node_line_dir=node_line_dir,
     [61](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/schema/node.py:61)                      strategies=self.strategy)

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:121, in run_retrieval_node(modules, module_params, previous_result, node_line_dir, strategies)
    [118](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:118) if any([module.__name__ in semantic_module_names for module in modules]):
    [119](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:119)     semantic_modules, semantic_module_params = zip(*filter(lambda x: x[0].__name__ in semantic_module_names,
    [120](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:120)                                                            zip(modules, module_params)))
--> [121](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:121)     semantic_results, semantic_times = run(semantic_modules, semantic_module_params)
    [122](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:122)     semantic_summary_df = save_and_summary(semantic_modules, semantic_module_params,
    [123](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:123)                                            semantic_results, semantic_times, filename_first)
    [124](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:124)     semantic_selected_result, semantic_selected_filename = find_best(semantic_results, semantic_times,
    [125](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:125)                                                                      semantic_summary_df['filename'].tolist())

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:62, in run_retrieval_node.<locals>.run(input_modules, input_module_params)
     [53](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:53) def run(input_modules, input_module_params) -> Tuple[List[pd.DataFrame], List]:
     [54](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:54)     """
     [55](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:55)         Run input modules and parameters.
     [56](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:56) 
   (...)
     [60](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:60)         Second, it returns list of execution times.
     [61](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:61)     """
---> [62](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:62)     result, execution_times = zip(*map(lambda task: measure_speed(
     [63](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:63)         task[0], project_dir=project_dir, previous_result=previous_result, **task[1]),
     [64](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:64)                                        zip(input_modules, input_module_params)))
     [65](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:65)     average_times = list(map(lambda x: x / len(result[0]), execution_times))
     [67](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:67)     # run metrics before filtering

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:62, in run_retrieval_node.<locals>.run.<locals>.<lambda>(task)
     [53](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:53) def run(input_modules, input_module_params) -> Tuple[List[pd.DataFrame], List]:
     [54](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:54)     """
     [55](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:55)         Run input modules and parameters.
     [56](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:56) 
   (...)
     [60](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:60)         Second, it returns list of execution times.
     [61](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:61)     """
---> [62](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:62)     result, execution_times = zip(*map(lambda task: measure_speed(
     [63](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:63)         task[0], project_dir=project_dir, previous_result=previous_result, **task[1]),
     [64](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:64)                                        zip(input_modules, input_module_params)))
     [65](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:65)     average_times = list(map(lambda x: x / len(result[0]), execution_times))
     [67](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/run.py:67)     # run metrics before filtering

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/strategy.py:14, in measure_speed(func, *args, **kwargs)
     [10](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/strategy.py:10) """
     [11](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/strategy.py:11) Method for measuring execution speed of the function.
     [12](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/strategy.py:12) """
     [13](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/strategy.py:13) start_time = time.time()
---> [14](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/strategy.py:14) result = func(*args, **kwargs)
     [15](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/strategy.py:15) end_time = time.time()
     [16](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/strategy.py:16) return result, end_time - start_time

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:56, in result_to_dataframe.<locals>.decorator_result_to_dataframe.<locals>.wrapper(*args, **kwargs)
     [54](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:54) @functools.wraps(func)
     [55](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:55) def wrapper(*args, **kwargs) -> pd.DataFrame:
---> [56](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:56)     results = func(*args, **kwargs)
     [57](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:57)     if len(column_names) == 1:
     [58](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:58)         df_input = {column_names[0]: results}

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/base.py:97, in retrieval_node.<locals>.wrapper(project_dir, previous_result, **kwargs)
     [95](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/base.py:95) # fetch data from corpus_data
     [96](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/base.py:96) corpus_data = pd.read_parquet(os.path.join(data_dir, "corpus.parquet"), engine='pyarrow')
---> [97](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/base.py:97) contents = fetch_contents(corpus_data, ids)
     [99](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/base.py:99) return contents, ids, scores

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:30, in fetch_contents(corpus_data, ids, column_name)
     [27](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:27) def fetch_contents_pure(ids: List[str], corpus_data: pd.DataFrame, column_name: str):
     [28](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:28)     return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids))
---> [30](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:30) result = flatten_apply(fetch_contents_pure, ids, corpus_data=corpus_data, column_name=column_name)
     [31](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:31) return result

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:335, in flatten_apply(func, nested_list, **kwargs)
    [333](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:333) df = pd.DataFrame({'col1': nested_list})
    [334](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:334) df = df.explode('col1')
--> [335](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:335) df['result'] = func(df['col1'].tolist(), **kwargs)
    [336](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:336) return df.groupby(level=0, sort=False)['result'].apply(list).tolist()

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:28, in fetch_contents.<locals>.fetch_contents_pure(ids, corpus_data, column_name)
     [27](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:27) def fetch_contents_pure(ids: List[str], corpus_data: pd.DataFrame, column_name: str):
---> [28](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:28)     return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids))

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:28, in fetch_contents.<locals>.fetch_contents_pure.<locals>.<lambda>(x)
     [27](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:27) def fetch_contents_pure(ids: List[str], corpus_data: pd.DataFrame, column_name: str):
---> [28](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:28)     return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids))

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:41, in fetch_one_content(corpus_data, id_, column_name)
     [39](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:39) fetch_result = corpus_data[corpus_data['doc_id'] == id_]
     [40](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:40) if fetch_result.empty:
---> [41](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:41)     raise ValueError(f"doc_id: {id_} not found in corpus_data.")
     [42](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:42) else:
     [43](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/utils/util.py:43)     return fetch_result[column_name].iloc[0]

ValueError: doc_id: e18c6855-f1c8-40ff-b9c5-ea1a0ab2f639 not found in corpus_data.

Code that bug is happened See above

Desktop (please complete the following information):

Additional context Thanks!

dividor commented 2 months ago

Ahh. It seems if the evaluate doesn't complete, I needed to remove the project directory and restart the kernel. Seems to be working now.

vkehfdl1 commented 2 months ago

@dividor That's right. You have to make the new folder when the data has changed.

Close this issue since this is not a bug.

tha23rd commented 1 week ago

Hi - happy to create a new issue, but I'm running into the same issue. I've deleted the autorag folder and have tried regenerating things multiple times, and I still get the:

ValueError: doc_id: 1240b6d7-c82b-4367-bbff-753fa6e50cbc not found in corpus_data.

I have triple checked that this doc_id exists in both qa.parquet and corpus.parquet, both in the files created by the QA process and the data folder that autorag creates when evaluating. I've deleted all other instances of parquet files in my project, so I genuinely am not sure what left to check here

Running via:

from autorag.evaluator import Evaluator

evaluator = Evaluator(qa_data_path='./qa.parquet', corpus_data_path='./corpus.parquet', project_dir='./rag_opt_project_dir')
evaluator.start_trial('./rag_opt_config.yaml')

Again, I have made sure to delete rag_opt_project_dir before running this code. Thanks!

cc @vkehfdl1

vkehfdl1 commented 1 week ago

Hello! @tha23rd I think you delete whole project directory, and double checked the doc_id is existed in qa and corpus df. Hmm.... It is weird. Can you send me a full bug trace to me? Love to help you.

(Plus, be sure to delete "all project directory". If you run project directory without setting it, you might find "data" or "resources"folder on your root. Maybe deleting it can fix your problem.)

tha23rd commented 1 week ago

Hello! @tha23rd I think you delete whole project directory, and double checked the doc_id is existed in qa and corpus df. Hmm.... It is weird. Can you send me a full bug trace to me? Love to help you.

(Plus, be sure to delete "all project directory". If you run project directory without setting it, you might find "data" or "resources"folder on your root. Maybe deleting it can fix your problem.)

Here's a stack trace (in the mean time I will double check the project directory):

[10/25/24 07:44:47] INFO     [node.py:55] >> Running node passage_augmenter...                                                                                                                                                                                          node.py:55
                    INFO     [base.py:21] >> Initialize passage augmenter node - PassPassageAugmenter module...                                                                                                                                                         base.py:21
                    INFO     [base.py:38] >> Running passage augmenter node - PassPassageAugmenter module...                                                                                                                                                            base.py:38
                    INFO     [base.py:33] >> Initialize passage augmenter node - PassPassageAugmenter module...                                                                                                                                                         base.py:33
                    INFO     [base.py:21] >> Initialize passage augmenter node - PrevNextPassageAugmenter module...                                                                                                                                                     base.py:21
                    INFO     [base.py:38] >> Running passage augmenter node - PrevNextPassageAugmenter module...                                                                                                                                                        base.py:38
                    ERROR    [__init__.py:60] >> Unexpected exception                                                                                                                                                                                               __init__.py:60
                             ╭──────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────────────────────────╮               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/gen_qa.py:75 in <module>                                                                                                                                                               │               
                             │                                                                                                                                                                                                                                    │               
                             │   72 from autorag.evaluator import Evaluator                                                                                                                                                                                       │               
                             │   73                                                                                                                                                                                                                               │               
                             │   74 evaluator = Evaluator(qa_data_path='./qa.parquet', corpus_data_path='./corpus.parquet',                                                                                                                                       │               
                             │      project_dir='./rag_opt_project_dir2')                                                                                                                                                                                         │               
                             │ ❱ 75 evaluator.start_trial('./rag_opt_config.yaml')                                                                                                                                                                                │               
                             │   76                                                                                                                                                                                                                               │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/evaluator.py:137 in start_trial                                                                                                             │               
                             │                                                                                                                                                                                                                                    │               
                             │   134 │   │   │   validator = Validator(                                                                                                                                                                                           │               
                             │   135 │   │   │   │   qa_data_path=self.qa_data_path, corpus_data_path=self.corpus_data_path                                                                                                                                       │               
                             │   136 │   │   │   )                                                                                                                                                                                                                │               
                             │ ❱ 137 │   │   │   validator.validate(yaml_path)                                                                                                                                                                                    │               
                             │   138 │   │                                                                                                                                                                                                                        │               
                             │   139 │   │   os.environ["PROJECT_DIR"] = self.project_dir                                                                                                                                                                         │               
                             │   140                                                                                                                                                                                                                              │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/validator.py:80 in validate                                                                                                                 │               
                             │                                                                                                                                                                                                                                    │               
                             │   77 │   │   │   │   corpus_data_path=corpus_path.name,                                                                                                                                                                            │               
                             │   78 │   │   │   │   project_dir=temp_project_dir,                                                                                                                                                                                 │               
                             │   79 │   │   │   )                                                                                                                                                                                                                 │               
                             │ ❱ 80 │   │   │   evaluator.start_trial(yaml_path, skip_validation=True)                                                                                                                                                            │               
                             │   81 │   │   │   qa_path.close()                                                                                                                                                                                                   │               
                             │   82 │   │   │   corpus_path.close()                                                                                                                                                                                               │               
                             │   83 │   │   │   os.unlink(qa_path.name)                                                                                                                                                                                           │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/evaluator.py:188 in start_trial                                                                                                             │               
                             │                                                                                                                                                                                                                                    │               
                             │   185 │   │   │   if i == 0:                                                                                                                                                                                                       │               
                             │   186 │   │   │   │   previous_result = self.qa_data                                                                                                                                                                               │               
                             │   187 │   │   │   logger.info(f"Running node line {node_line_name}...")                                                                                                                                                            │               
                             │ ❱ 188 │   │   │   previous_result = run_node_line(node_line, node_line_dir, previous_result)                                                                                                                                       │               
                             │   189 │   │   │                                                                                                                                                                                                                    │               
                             │   190 │   │   │   trial_summary_df = self._append_node_line_summary(                                                                                                                                                               │               
                             │   191 │   │   │   │   node_line_name, node_line_dir, trial_summary_df                                                                                                                                                              │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/node_line.py:47 in run_node_line                                                                                                            │               
                             │                                                                                                                                                                                                                                    │               
                             │   44 │                                                                                                                                                                                                                             │               
                             │   45 │   summary_lst = []                                                                                                                                                                                                          │               
                             │   46 │   for node in nodes:                                                                                                                                                                                                        │               
                             │ ❱ 47 │   │   previous_result = node.run(previous_result, node_line_dir)                                                                                                                                                            │               
                             │   48 │   │   node_summary_df = load_summary_file(                                                                                                                                                                                  │               
                             │   49 │   │   │   os.path.join(node_line_dir, node.node_type, "summary.csv")                                                                                                                                                        │               
                             │   50 │   │   )                                                                                                                                                                                                                     │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/schema/node.py:57 in run                                                                                                                    │               
                             │                                                                                                                                                                                                                                    │               
                             │    54 │   def run(self, previous_result: pd.DataFrame, node_line_dir: str) -> pd.DataFrame:                                                                                                                                        │               
                             │    55 │   │   logger.info(f"Running node {self.node_type}...")                                                                                                                                                                     │               
                             │    56 │   │   input_modules, input_params = self.get_param_combinations()                                                                                                                                                          │               
                             │ ❱  57 │   │   return self.run_node(                                                                                                                                                                                                │               
                             │    58 │   │   │   modules=input_modules,                                                                                                                                                                                           │               
                             │    59 │   │   │   module_params=input_params,                                                                                                                                                                                      │               
                             │    60 │   │   │   previous_result=previous_result,                                                                                                                                                                                 │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/nodes/passageaugmenter/run.py:32 in run_passage_augmenter_node                                                                              │               
                             │                                                                                                                                                                                                                                    │               
                             │    29 │   retrieval_gt = qa_df["retrieval_gt"].tolist()                                                                                                                                                                            │               
                             │    30 │   retrieval_gt = apply_recursive(lambda x: str(x), to_list(retrieval_gt))                                                                                                                                                  │               
                             │    31 │                                                                                                                                                                                                                            │               
                             │ ❱  32 │   results, execution_times = zip(                                                                                                                                                                                          │               
                             │    33 │   │   *map(                                                                                                                                                                                                                │               
                             │    34 │   │   │   lambda task: measure_speed(                                                                                                                                                                                      │               
                             │    35 │   │   │   │   task[0].run_evaluator,                                                                                                                                                                                       │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/nodes/passageaugmenter/run.py:34 in <lambda>                                                                                                │               
                             │                                                                                                                                                                                                                                    │               
                             │    31 │                                                                                                                                                                                                                            │               
                             │    32 │   results, execution_times = zip(                                                                                                                                                                                          │               
                             │    33 │   │   *map(                                                                                                                                                                                                                │               
                             │ ❱  34 │   │   │   lambda task: measure_speed(                                                                                                                                                                                      │               
                             │    35 │   │   │   │   task[0].run_evaluator,                                                                                                                                                                                       │               
                             │    36 │   │   │   │   project_dir=project_dir,                                                                                                                                                                                     │               
                             │    37 │   │   │   │   previous_result=previous_result,                                                                                                                                                                             │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/strategy.py:14 in measure_speed                                                                                                             │               
                             │                                                                                                                                                                                                                                    │               
                             │    11 │   Method for measuring execution speed of the function.                                                                                                                                                                    │               
                             │    12 │   """                                                                                                                                                                                                                      │               
                             │    13 │   start_time = time.time()                                                                                                                                                                                                 │               
                             │ ❱  14 │   result = func(*args, **kwargs)                                                                                                                                                                                           │               
                             │    15 │   end_time = time.time()                                                                                                                                                                                                   │               
                             │    16 │   return result, end_time - start_time                                                                                                                                                                                     │               
                             │    17                                                                                                                                                                                                                              │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/schema/base.py:26 in run_evaluator                                                                                                          │               
                             │                                                                                                                                                                                                                                    │               
                             │   23 │   │   **kwargs,                                                                                                                                                                                                             │               
                             │   24 │   ):                                                                                                                                                                                                                        │               
                             │   25 │   │   instance = cls(project_dir, *args, **kwargs)                                                                                                                                                                          │               
                             │ ❱ 26 │   │   result = instance.pure(previous_result, *args, **kwargs)                                                                                                                                                              │               
                             │   27 │   │   del instance                                                                                                                                                                                                          │               
                             │   28 │   │   return result                                                                                                                                                                                                         │               
                             │   29                                                                                                                                                                                                                               │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:69 in wrapper                                                                                                                 │               
                             │                                                                                                                                                                                                                                    │               
                             │    66 │   def decorator_result_to_dataframe(func: Callable):                                                                                                                                                                       │               
                             │    67 │   │   @functools.wraps(func)                                                                                                                                                                                               │               
                             │    68 │   │   def wrapper(*args, **kwargs) -> pd.DataFrame:                                                                                                                                                                        │               
                             │ ❱  69 │   │   │   results = func(*args, **kwargs)                                                                                                                                                                                  │               
                             │    70 │   │   │   if len(column_names) == 1:                                                                                                                                                                                       │               
                             │    71 │   │   │   │   df_input = {column_names[0]: results}                                                                                                                                                                        │               
                             │    72 │   │   │   else:                                                                                                                                                                                                            │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/nodes/passageaugmenter/prev_next_augmenter.py:69 in pure                                                                                    │               
                             │                                                                                                                                                                                                                                    │               
                             │    66 │   │   augmented_ids = self._pure(ids, num_passages, mode)                                                                                                                                                                  │               
                             │    67 │   │                                                                                                                                                                                                                        │               
                             │    68 │   │   # fetch contents from corpus to use augmented ids                                                                                                                                                                    │               
                             │ ❱  69 │   │   augmented_contents = fetch_contents(self.corpus_df, augmented_ids)                                                                                                                                                   │               
                             │    70 │   │                                                                                                                                                                                                                        │               
                             │    71 │   │   query_embeddings, contents_embeddings = embedding_query_content(                                                                                                                                                     │               
                             │    72 │   │   │   queries, augmented_contents, self.embedding_model, batch=128                                                                                                                                                     │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:37 in fetch_contents                                                                                                          │               
                             │                                                                                                                                                                                                                                    │               
                             │    34 │   ):                                                                                                                                                                                                                       │               
                             │    35 │   │   return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids))                                                                                                                                      │               
                             │    36 │                                                                                                                                                                                                                            │               
                             │ ❱  37 │   result = flatten_apply(                                                                                                                                                                                                  │               
                             │    38 │   │   fetch_contents_pure, ids, corpus_data=corpus_data, column_name=column_name                                                                                                                                           │               
                             │    39 │   )                                                                                                                                                                                                                        │               
                             │    40 │   return result                                                                                                                                                                                                            │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:370 in flatten_apply                                                                                                          │               
                             │                                                                                                                                                                                                                                    │               
                             │   367 │   """                                                                                                                                                                                                                      │               
                             │   368 │   df = pd.DataFrame({"col1": nested_list})                                                                                                                                                                                 │               
                             │   369 │   df = df.explode("col1")                                                                                                                                                                                                  │               
                             │ ❱ 370 │   df["result"] = func(df["col1"].tolist(), **kwargs)                                                                                                                                                                       │               
                             │   371 │   return df.groupby(level=0, sort=False)["result"].apply(list).tolist()                                                                                                                                                    │               
                             │   372                                                                                                                                                                                                                              │               
                             │   373                                                                                                                                                                                                                              │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:35 in fetch_contents_pure                                                                                                     │               
                             │                                                                                                                                                                                                                                    │               
                             │    32 │   def fetch_contents_pure(                                                                                                                                                                                                 │               
                             │    33 │   │   ids: List[str], corpus_data: pd.DataFrame, column_name: str                                                                                                                                                          │               
                             │    34 │   ):                                                                                                                                                                                                                       │               
                             │ ❱  35 │   │   return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids))                                                                                                                                      │               
                             │    36 │                                                                                                                                                                                                                            │               
                             │    37 │   result = flatten_apply(                                                                                                                                                                                                  │               
                             │    38 │   │   fetch_contents_pure, ids, corpus_data=corpus_data, column_name=column_name                                                                                                                                           │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:35 in <lambda>                                                                                                                │               
                             │                                                                                                                                                                                                                                    │               
                             │    32 │   def fetch_contents_pure(                                                                                                                                                                                                 │               
                             │    33 │   │   ids: List[str], corpus_data: pd.DataFrame, column_name: str                                                                                                                                                          │               
                             │    34 │   ):                                                                                                                                                                                                                       │               
                             │ ❱  35 │   │   return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids))                                                                                                                                      │               
                             │    36 │                                                                                                                                                                                                                            │               
                             │    37 │   result = flatten_apply(                                                                                                                                                                                                  │               
                             │    38 │   │   fetch_contents_pure, ids, corpus_data=corpus_data, column_name=column_name                                                                                                                                           │               
                             │                                                                                                                                                                                                                                    │               
                             │ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:54 in fetch_one_content                                                                                                       │               
                             │                                                                                                                                                                                                                                    │               
                             │    51 │   │   │   return None                                                                                                                                                                                                      │               
                             │    52 │   │   fetch_result = corpus_data[corpus_data[id_column_name] == id_]                                                                                                                                                       │               
                             │    53 │   │   if fetch_result.empty:                                                                                                                                                                                               │               
                             │ ❱  54 │   │   │   raise ValueError(f"doc_id: {id_} not found in corpus_data.")                                                                                                                                                     │               
                             │    55 │   │   else:                                                                                                                                                                                                                │               
                             │    56 │   │   │   return fetch_result[column_name].iloc[0]                                                                                                                                                                         │               
                             │    57 │   else:                                                                                                                                                                                                                    │               
                             ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯               
                             ValueError: doc_id: 1240b6d7-c82b-4367-bbff-753fa6e50cbc not found in corpus_data.   
tha23rd commented 1 week ago

And here's the code I used:

from autorag.parser import Parser

# step1: parse

filepaths = "./test.txt"
parser = Parser(filepaths, "./parse_project_dir")
parser.start_parsing("./parsing.yml")

# step 2: chunk

from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="./parse_project_dir/0/0.parquet", project_dir="./chunk_project_dir")
chunker.start_chunking("./chunk_config.yaml")

# step 3: qa gen

from llama_index.llms.openai import OpenAI

from autorag.data.qa.filter.dontknow import dontknow_filter_rule_based
from autorag.data.qa.generation_gt.llama_index_gen_gt import (
    make_basic_gen_gt,
    make_concise_gen_gt,
)
from autorag.data.qa.query.llama_gen_query import factoid_query_gen, concept_completion_query_gen
from autorag.data.qa.sample import random_single_hop
from autorag.data.qa.schema import Raw, Corpus
import pandas as pd

initial_raw_df = pd.read_parquet("./parse_project_dir/0/0.parquet")
initial_raw = Raw(initial_raw_df)

llm = OpenAI(model='gpt-4o-mini',api_key='')

corpus_df = pd.read_parquet("./chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, initial_raw)

initial_qa = (
    corpus_instance.sample(random_single_hop, n=7)
    .map(
        lambda df: df.reset_index(drop=True),
    )
    .make_retrieval_gt_contents()
    .batch_apply(
        factoid_query_gen,  # query generation
        llm=llm,
    )
    .batch_apply(
        make_basic_gen_gt,  # answer generation (basic)
        llm=llm,
    )
    # .batch_apply(
    #     make_concise_gen_gt,  # answer generation (concise)
    #     llm=llm,
    # )
    # .batch_apply(
    #     concept_completion_query_gen,
    #     llm=llm,
    # )
    .filter(
        dontknow_filter_rule_based,  # filter don't know
        lang="en",
        # llm=llm,
    )
)

initial_qa.to_parquet('./qa.parquet', './corpus.parquet')

# step 4: rag optimize

from autorag.evaluator import Evaluator

evaluator = Evaluator(qa_data_path='./qa.parquet', corpus_data_path='./corpus.parquet', project_dir='./rag_opt_project_dir2')
evaluator.start_trial('./rag_opt_config.yaml')
vkehfdl1 commented 1 week ago

Hello @tha23rd It is certain that you are using passageaugmenter node in your YAML file!! Unfortunately, the validation of passage augmenter is not working right now. See #863 But, when you start a trial, it automatically run validation, which occurs error like ValueError: doc_id: 0eec7e3a-e1c0-4d33-8cc5-7e604b30339b not found in corpus_data.

So there are two options to use.

  1. Simply skip validation on the trial
    autorag evaluate --config your/path/to/default_config.yaml --qa_data_path your/path/to/qa.parquet --corpus_data_path your/path/to/corpus.parquet --project_dir ./your/project/directory --skip_validation true

or

from autorag.evaluator import Evaluator

evaluator = Evaluator(qa_data_path='your/path/to/qa.parquet', corpus_data_path='your/path/to/corpus.parquet',
                      project_dir='your/path/to/project_directory', skip_validation=True)
evaluator.start_trial('your/path/to/config.yaml')
  1. Do not use passage augmenter. => Delete passage augmenter node on your config YAML file.

In the meantime, we will fix the #863 issue ASAP. Thank you.