Closed Martyniqo closed 2 months ago
Hi @Martyniqo , this is great feedback! We will review your code and get back to you soon! Also, would you consider create a pull request for your changes? We can review and merge your code accordingly. Thanks for your effort!
Hi,
Thank you for your response! I’ll create the pull request asap ☺️
Regards, Martyna
On Tue, 13 Aug 2024 at 17:47, Xiangkun Hu @.***> wrote:
Hi @Martyniqo https://github.com/Martyniqo , this is great feedback! We will review your code and get back to you soon! Also, would you consider create a pull request for your changes? We can review and merge your code accordingly. Thanks for your effort!
— Reply to this email directly, view it on GitHub https://github.com/amazon-science/RAGChecker/issues/3#issuecomment-2286574990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZZKUARAY3FB7MB4MENST4TZRITBDAVCNFSM6AAAAABMLARM66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBWGU3TIOJZGA . You are receiving this because you were mentioned.Message ID: @.***>
Hi @Martyniqo, are you running the example here with only the backbone LLM changed?
I ran the code below but couldn't reproduce the error.
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics
# initialize ragresults from json/dict
with open("examples/checking_inputs.json") as fp:
rag_results = RAGResults.from_json(fp.read())
# set-up the evaluator
evaluator = RAGChecker(
extractor_name='bedrock/meta.llama3-1-8b-instruct-v1:0',
checker_name='bedrock/meta.llama3-1-8b-instruct-v1:0',
batch_size_extractor=32, batch_size_checker=32
)
evaluator.evaluate(rag_results, all_metrics)
print(rag_results)
Since the input of functions in computation.py
is from our upstream package RefChecker, the data type should have been well-controlled only if the input data has the same format with the example here.
If you are running RAGChecker on your own data, could you provide some samples leading to the error?
The bug has been fixed by modifying the output formats of the dependency RefChecker. Please install the latest version to avoid the error. Feel free to reopen the issue if you find something wrong on your data.
Problem: I encountered an issue in the computation.py file where the computation functions don't always handle different input data types properly. Specifically, I noticed that when the input data is a one-dimensional array or a list, it can cause errors that prevent the functions from running correctly.
What happened: While running these:
evaluator = RAGChecker( extractor_name='bedrock/meta.llama3-1-8b-instruct-v1:0', checker_name='bedrock/meta.llama3-1-8b-instruct-v1:0', batch_size_extractor=32, batch_size_checker=32 )
evaluator.evaluate(rag_results, all_metrics)I received this error message while trying to compute _retrievermetrics and _generatormetrics :
Error during evaluation: object of type 'numpy.bool_' has no len()
This error suggests that the code tried to calculate the length of a boolean value, which shouldn't happen. It seems that the input data wasn't processed as expected, leading to this issue.
What I changed: Added Type Checks: I updated several functions (like compute_precision, compute_recall, and compute_retrieval) to include type checks. Now, the code verifies that the input data is either a numpy array or a list before proceeding. This should help prevent similar errors in the future.
Better Handling of 1D Arrays: In functions such as compute_retrieval and compute_context_utilization, I added logic to check if the input is a one-dimensional array. If it is, the code adjusts accordingly, which should avoid errors related to operations like np.max.
Impact on computation accuracy: While these changes should make the code more robust, there's a chance they could affect the accuracy of the computations. Specifically, if the input data doesn't match the expected format, the new conditions might change how the computations are carried out. I've tested the changes, but I'm not entirely sure if they might cause something to work incorrectly. If you notice any issues or if the changes affect the computations in unintended ways, please advise.
Changes in code:
def evaluate_precision
if isinstance(answer2response, (np.ndarray, list)) and len(answer2response) > 0: result.metrics[metrics.precision] = np.mean(answer2response) else: result.metrics[metrics.precision] = 0.
def evaluate_retrieval
if isinstance(retrieved2answer, (np.ndarray, list)) and len(retrieved2answer) > 0: if isinstance(retrieved2answer[0], (np.ndarray, list)) and len(retrieved2answer[0]) > 0: claim_recalled = np.max(retrieved2answer, axis=1) result.metrics[metrics.claim_recall] = np.mean(claim_recalled) psg_useful = np.max(retrieved2answer, axis=0) result.metrics[metrics.context_precision] = np.mean(psg_useful) else: claim_recalled = retrieved2answer result.metrics[metrics.claim_recall] = np.mean(claim_recalled) result.metrics[metrics.context_precision] = 0. else: result.metrics[metrics.claim_recall] = 0. result.metrics[metrics.context_precision] = 0.
def evaluate_context_utilization
if isinstance(retrieved2answer, (np.ndarray, list)) and len(retrieved2answer) > 0: if np.ndim(retrieved2answer) == 1 or (np.ndim(retrieved2answer) > 1 and len(retrieved2answer[0]) > 0): claim_recalled = np.max(retrieved2answer, axis=1) if np.ndim(retrieved2answer) > 1 else retrieved2answer if np.sum(claim_recalled) > 0: claim_used = claim_recalled & response2answer result.metrics[metrics.context_utilization] = np.sum(claim_used) / np.sum(claim_recalled) else: result.metrics[metrics.context_utilization] = 0. else: result.metrics[metrics.context_utilization] = 0. else: result.metrics[metrics.context_utilization] = 0.
computation-v2.zip