Closed dividor closed 2 months ago
Ahh. It seems if the evaluate doesn't complete, I needed to remove the project directory and restart the kernel. Seems to be working now.
@dividor That's right. You have to make the new folder when the data has changed.
Close this issue since this is not a bug.
Hi - happy to create a new issue, but I'm running into the same issue. I've deleted the autorag folder and have tried regenerating things multiple times, and I still get the:
ValueError: doc_id: 1240b6d7-c82b-4367-bbff-753fa6e50cbc not found in corpus_data.
I have triple checked that this doc_id exists in both qa.parquet and corpus.parquet, both in the files created by the QA process and the data folder that autorag creates when evaluating. I've deleted all other instances of parquet files in my project, so I genuinely am not sure what left to check here
Running via:
from autorag.evaluator import Evaluator
evaluator = Evaluator(qa_data_path='./qa.parquet', corpus_data_path='./corpus.parquet', project_dir='./rag_opt_project_dir')
evaluator.start_trial('./rag_opt_config.yaml')
Again, I have made sure to delete rag_opt_project_dir
before running this code. Thanks!
cc @vkehfdl1
Hello! @tha23rd I think you delete whole project directory, and double checked the doc_id is existed in qa and corpus df. Hmm.... It is weird. Can you send me a full bug trace to me? Love to help you.
(Plus, be sure to delete "all project directory". If you run project directory without setting it, you might find "data" or "resources"folder on your root. Maybe deleting it can fix your problem.)
Hello! @tha23rd I think you delete whole project directory, and double checked the doc_id is existed in qa and corpus df. Hmm.... It is weird. Can you send me a full bug trace to me? Love to help you.
(Plus, be sure to delete "all project directory". If you run project directory without setting it, you might find "data" or "resources"folder on your root. Maybe deleting it can fix your problem.)
Here's a stack trace (in the mean time I will double check the project directory):
[10/25/24 07:44:47] INFO [node.py:55] >> Running node passage_augmenter... node.py:55
INFO [base.py:21] >> Initialize passage augmenter node - PassPassageAugmenter module... base.py:21
INFO [base.py:38] >> Running passage augmenter node - PassPassageAugmenter module... base.py:38
INFO [base.py:33] >> Initialize passage augmenter node - PassPassageAugmenter module... base.py:33
INFO [base.py:21] >> Initialize passage augmenter node - PrevNextPassageAugmenter module... base.py:21
INFO [base.py:38] >> Running passage augmenter node - PrevNextPassageAugmenter module... base.py:38
ERROR [__init__.py:60] >> Unexpected exception __init__.py:60
╭──────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────────────────────────╮
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/gen_qa.py:75 in <module> │
│ │
│ 72 from autorag.evaluator import Evaluator │
│ 73 │
│ 74 evaluator = Evaluator(qa_data_path='./qa.parquet', corpus_data_path='./corpus.parquet', │
│ project_dir='./rag_opt_project_dir2') │
│ ❱ 75 evaluator.start_trial('./rag_opt_config.yaml') │
│ 76 │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/evaluator.py:137 in start_trial │
│ │
│ 134 │ │ │ validator = Validator( │
│ 135 │ │ │ │ qa_data_path=self.qa_data_path, corpus_data_path=self.corpus_data_path │
│ 136 │ │ │ ) │
│ ❱ 137 │ │ │ validator.validate(yaml_path) │
│ 138 │ │ │
│ 139 │ │ os.environ["PROJECT_DIR"] = self.project_dir │
│ 140 │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/validator.py:80 in validate │
│ │
│ 77 │ │ │ │ corpus_data_path=corpus_path.name, │
│ 78 │ │ │ │ project_dir=temp_project_dir, │
│ 79 │ │ │ ) │
│ ❱ 80 │ │ │ evaluator.start_trial(yaml_path, skip_validation=True) │
│ 81 │ │ │ qa_path.close() │
│ 82 │ │ │ corpus_path.close() │
│ 83 │ │ │ os.unlink(qa_path.name) │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/evaluator.py:188 in start_trial │
│ │
│ 185 │ │ │ if i == 0: │
│ 186 │ │ │ │ previous_result = self.qa_data │
│ 187 │ │ │ logger.info(f"Running node line {node_line_name}...") │
│ ❱ 188 │ │ │ previous_result = run_node_line(node_line, node_line_dir, previous_result) │
│ 189 │ │ │ │
│ 190 │ │ │ trial_summary_df = self._append_node_line_summary( │
│ 191 │ │ │ │ node_line_name, node_line_dir, trial_summary_df │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/node_line.py:47 in run_node_line │
│ │
│ 44 │ │
│ 45 │ summary_lst = [] │
│ 46 │ for node in nodes: │
│ ❱ 47 │ │ previous_result = node.run(previous_result, node_line_dir) │
│ 48 │ │ node_summary_df = load_summary_file( │
│ 49 │ │ │ os.path.join(node_line_dir, node.node_type, "summary.csv") │
│ 50 │ │ ) │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/schema/node.py:57 in run │
│ │
│ 54 │ def run(self, previous_result: pd.DataFrame, node_line_dir: str) -> pd.DataFrame: │
│ 55 │ │ logger.info(f"Running node {self.node_type}...") │
│ 56 │ │ input_modules, input_params = self.get_param_combinations() │
│ ❱ 57 │ │ return self.run_node( │
│ 58 │ │ │ modules=input_modules, │
│ 59 │ │ │ module_params=input_params, │
│ 60 │ │ │ previous_result=previous_result, │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/nodes/passageaugmenter/run.py:32 in run_passage_augmenter_node │
│ │
│ 29 │ retrieval_gt = qa_df["retrieval_gt"].tolist() │
│ 30 │ retrieval_gt = apply_recursive(lambda x: str(x), to_list(retrieval_gt)) │
│ 31 │ │
│ ❱ 32 │ results, execution_times = zip( │
│ 33 │ │ *map( │
│ 34 │ │ │ lambda task: measure_speed( │
│ 35 │ │ │ │ task[0].run_evaluator, │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/nodes/passageaugmenter/run.py:34 in <lambda> │
│ │
│ 31 │ │
│ 32 │ results, execution_times = zip( │
│ 33 │ │ *map( │
│ ❱ 34 │ │ │ lambda task: measure_speed( │
│ 35 │ │ │ │ task[0].run_evaluator, │
│ 36 │ │ │ │ project_dir=project_dir, │
│ 37 │ │ │ │ previous_result=previous_result, │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/strategy.py:14 in measure_speed │
│ │
│ 11 │ Method for measuring execution speed of the function. │
│ 12 │ """ │
│ 13 │ start_time = time.time() │
│ ❱ 14 │ result = func(*args, **kwargs) │
│ 15 │ end_time = time.time() │
│ 16 │ return result, end_time - start_time │
│ 17 │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/schema/base.py:26 in run_evaluator │
│ │
│ 23 │ │ **kwargs, │
│ 24 │ ): │
│ 25 │ │ instance = cls(project_dir, *args, **kwargs) │
│ ❱ 26 │ │ result = instance.pure(previous_result, *args, **kwargs) │
│ 27 │ │ del instance │
│ 28 │ │ return result │
│ 29 │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:69 in wrapper │
│ │
│ 66 │ def decorator_result_to_dataframe(func: Callable): │
│ 67 │ │ @functools.wraps(func) │
│ 68 │ │ def wrapper(*args, **kwargs) -> pd.DataFrame: │
│ ❱ 69 │ │ │ results = func(*args, **kwargs) │
│ 70 │ │ │ if len(column_names) == 1: │
│ 71 │ │ │ │ df_input = {column_names[0]: results} │
│ 72 │ │ │ else: │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/nodes/passageaugmenter/prev_next_augmenter.py:69 in pure │
│ │
│ 66 │ │ augmented_ids = self._pure(ids, num_passages, mode) │
│ 67 │ │ │
│ 68 │ │ # fetch contents from corpus to use augmented ids │
│ ❱ 69 │ │ augmented_contents = fetch_contents(self.corpus_df, augmented_ids) │
│ 70 │ │ │
│ 71 │ │ query_embeddings, contents_embeddings = embedding_query_content( │
│ 72 │ │ │ queries, augmented_contents, self.embedding_model, batch=128 │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:37 in fetch_contents │
│ │
│ 34 │ ): │
│ 35 │ │ return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids)) │
│ 36 │ │
│ ❱ 37 │ result = flatten_apply( │
│ 38 │ │ fetch_contents_pure, ids, corpus_data=corpus_data, column_name=column_name │
│ 39 │ ) │
│ 40 │ return result │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:370 in flatten_apply │
│ │
│ 367 │ """ │
│ 368 │ df = pd.DataFrame({"col1": nested_list}) │
│ 369 │ df = df.explode("col1") │
│ ❱ 370 │ df["result"] = func(df["col1"].tolist(), **kwargs) │
│ 371 │ return df.groupby(level=0, sort=False)["result"].apply(list).tolist() │
│ 372 │
│ 373 │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:35 in fetch_contents_pure │
│ │
│ 32 │ def fetch_contents_pure( │
│ 33 │ │ ids: List[str], corpus_data: pd.DataFrame, column_name: str │
│ 34 │ ): │
│ ❱ 35 │ │ return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids)) │
│ 36 │ │
│ 37 │ result = flatten_apply( │
│ 38 │ │ fetch_contents_pure, ids, corpus_data=corpus_data, column_name=column_name │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:35 in <lambda> │
│ │
│ 32 │ def fetch_contents_pure( │
│ 33 │ │ ids: List[str], corpus_data: pd.DataFrame, column_name: str │
│ 34 │ ): │
│ ❱ 35 │ │ return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids)) │
│ 36 │ │
│ 37 │ result = flatten_apply( │
│ 38 │ │ fetch_contents_pure, ids, corpus_data=corpus_data, column_name=column_name │
│ │
│ /home/zach/dev/sql-agent/jaffle_shop_duckdb/.venv/lib/python3.12/site-packages/autorag/utils/util.py:54 in fetch_one_content │
│ │
│ 51 │ │ │ return None │
│ 52 │ │ fetch_result = corpus_data[corpus_data[id_column_name] == id_] │
│ 53 │ │ if fetch_result.empty: │
│ ❱ 54 │ │ │ raise ValueError(f"doc_id: {id_} not found in corpus_data.") │
│ 55 │ │ else: │
│ 56 │ │ │ return fetch_result[column_name].iloc[0] │
│ 57 │ else: │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: doc_id: 1240b6d7-c82b-4367-bbff-753fa6e50cbc not found in corpus_data.
And here's the code I used:
from autorag.parser import Parser
# step1: parse
filepaths = "./test.txt"
parser = Parser(filepaths, "./parse_project_dir")
parser.start_parsing("./parsing.yml")
# step 2: chunk
from autorag.chunker import Chunker
chunker = Chunker.from_parquet(parsed_data_path="./parse_project_dir/0/0.parquet", project_dir="./chunk_project_dir")
chunker.start_chunking("./chunk_config.yaml")
# step 3: qa gen
from llama_index.llms.openai import OpenAI
from autorag.data.qa.filter.dontknow import dontknow_filter_rule_based
from autorag.data.qa.generation_gt.llama_index_gen_gt import (
make_basic_gen_gt,
make_concise_gen_gt,
)
from autorag.data.qa.query.llama_gen_query import factoid_query_gen, concept_completion_query_gen
from autorag.data.qa.sample import random_single_hop
from autorag.data.qa.schema import Raw, Corpus
import pandas as pd
initial_raw_df = pd.read_parquet("./parse_project_dir/0/0.parquet")
initial_raw = Raw(initial_raw_df)
llm = OpenAI(model='gpt-4o-mini',api_key='')
corpus_df = pd.read_parquet("./chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, initial_raw)
initial_qa = (
corpus_instance.sample(random_single_hop, n=7)
.map(
lambda df: df.reset_index(drop=True),
)
.make_retrieval_gt_contents()
.batch_apply(
factoid_query_gen, # query generation
llm=llm,
)
.batch_apply(
make_basic_gen_gt, # answer generation (basic)
llm=llm,
)
# .batch_apply(
# make_concise_gen_gt, # answer generation (concise)
# llm=llm,
# )
# .batch_apply(
# concept_completion_query_gen,
# llm=llm,
# )
.filter(
dontknow_filter_rule_based, # filter don't know
lang="en",
# llm=llm,
)
)
initial_qa.to_parquet('./qa.parquet', './corpus.parquet')
# step 4: rag optimize
from autorag.evaluator import Evaluator
evaluator = Evaluator(qa_data_path='./qa.parquet', corpus_data_path='./corpus.parquet', project_dir='./rag_opt_project_dir2')
evaluator.start_trial('./rag_opt_config.yaml')
Hello @tha23rd
It is certain that you are using passageaugmenter
node in your YAML file!!
Unfortunately, the validation of passage augmenter is not working right now. See #863
But, when you start a trial, it automatically run validation, which occurs error like ValueError: doc_id: 0eec7e3a-e1c0-4d33-8cc5-7e604b30339b not found in corpus_data.
So there are two options to use.
autorag evaluate --config your/path/to/default_config.yaml --qa_data_path your/path/to/qa.parquet --corpus_data_path your/path/to/corpus.parquet --project_dir ./your/project/directory --skip_validation true
or
from autorag.evaluator import Evaluator
evaluator = Evaluator(qa_data_path='your/path/to/qa.parquet', corpus_data_path='your/path/to/corpus.parquet',
project_dir='your/path/to/project_directory', skip_validation=True)
evaluator.start_trial('your/path/to/config.yaml')
In the meantime, we will fix the #863 issue ASAP. Thank you.
Describe the bug I generated a QA set ..
Note the retrieval_gt doc_id e18c6855-f1c8-40ff-b9c5-ea1a0ab2f639.
Checking the corpus ...
It exists. However, when I run the evaluate with ...
I get an error saying this doc_id doesn't exist ...
ValueError: doc_id: e18c6855-f1c8-40ff-b9c5-ea1a0ab2f639 not found in corpus_data.
I have restarted the kernel, so that it's reading everything from the file system, as shown above, same error.
To Reproduce See above.
Expected behavior Should work. :)
Full Error log
Code that bug is happened See above
Desktop (please complete the following information):
Additional context Thanks!