The goal of this PR is to allow retrying LLM response again when AlignScore fails because of a low score N times (default is 0).
Changes
The following changes have been made:
Add backoff library dependency
Updated endpoint to retry when response is of type QueryResponseError and the error is low alignment score. Also allowed to add the previous failure raison in response.debug_info["past_failure"]
Future Tasks (optional)
How has this been tested?
Testing this is tricky because for this change to be observed we need LLM response to work but AlignScore to fail, and finding these cases are not straighforward.
Was tested two ways:
First way is :
Set ALIGN_SCORE_THRESHOLD to an unrealistic score (example 1.5). That way AlignScore fails
Run /search with generate_lllm_response set to true. with a content and a question relevant for the content. An example is content: Here we are going to talk about pineapples because of their pine shapes and the applee like tast
and question: Are apple related to pineapples,
Make sure LLM response is ran twice (by checking logs) if ALIGN_SCORE_N_RETRIES is set to the default value (1)
Make sure debug_info["past_failure"] is in the returned response.
Second way was by still setting ALIGN_SCORE_THRESHOLD to a value >1 but then adding a logic in the code to make sure the value of ALIGN_SCORE_THRESHOLD is reduced to a reasonable value everytime the LLM response is regenerated to make sure that after the second retry the AlignScore check passes. So, in second run, ALIGN_SCORE_THRESHOLD should be less than 0.8. This approach is not straightforward. I am open to more efficient ways of testing this feature.
Checklist
Fill with x for completed.
[x] My code follows the style guidelines of this project
[x] I have reviewed my own code to ensure good quality
[x] I have tested the functionality of my code to ensure it works as intended
[x] I have resolved merge conflicts
(Delete any items below that are not relevant)
[ ] I have updated the automated tests
[ ] I have updated the scripts in scripts/
[ ] I have updated the requirements
[ ] I have updated the README file
[ ] I have updated affected documentation
[ ] I have added a blogpost in Latest Updates
[ ] I have updated the CI/CD scripts in .github/workflows/
Reviewer: @amiraliemami Estimate: 40mins
Ticket
Fixes:AAQ-675
Description
Goal
The goal of this PR is to allow retrying LLM response again when AlignScore fails because of a low score N times (default is 0).
Changes
The following changes have been made:
QueryResponseError
and the error is low alignment score. Also allowed to add the previous failure raison inresponse.debug_info["past_failure"]
Future Tasks (optional)
How has this been tested?
Testing this is tricky because for this change to be observed we need LLM response to work but AlignScore to fail, and finding these cases are not straighforward. Was tested two ways: First way is :
ALIGN_SCORE_THRESHOLD
to an unrealistic score (example 1.5). That way AlignScore fails/search
withgenerate_lllm_response
set totrue
. with a content and a question relevant for the content. An example is content:Here we are going to talk about pineapples because of their pine shapes and the applee like tast
and question:Are apple related to pineapples
,ALIGN_SCORE_THRESHOLD
to a value >1 but then adding a logic in the code to make sure the value ofALIGN_SCORE_THRESHOLD
is reduced to a reasonable value everytime the LLM response is regenerated to make sure that after the second retry the AlignScore check passes. So, in second run,ALIGN_SCORE_THRESHOLD
should be less than 0.8. This approach is not straightforward. I am open to more efficient ways of testing this feature.Checklist
Fill with
x
for completed.(Delete any items below that are not relevant)
scripts/
.github/workflows/