QAEval chain currently utilizes the Language Model (LLM) provided in the harness to evaluate responses generated by the LLM itself.
Additionally, the current implementation does not save the samples when executing harness.run(). This lack of sample saving limits the ability to review or analyze the generated samples without rerunning the model.
Suggested Solution
Model Selection Option : Provide users with an option to choose the model for evaluation. By default, load the GPT models for better evaluation, as some models may not perform well in instruction tuning, leading to suboptimal results.
Save harness.run() Results : Implement the functionality to save the results obtained from harness.run(). This enables users to directly import the results and make necessary changes in the evaluation without rerunning the model. This enhancement facilitates a more efficient and flexible evaluation process.
Current State
QAEval chain currently utilizes the Language Model (LLM) provided in the harness to evaluate responses generated by the LLM itself.
Additionally, the current implementation does not save the samples when executing
harness.run()
. This lack of sample saving limits the ability to review or analyze the generated samples without rerunning the model.Suggested Solution
Model Selection Option : Provide users with an option to choose the model for evaluation. By default, load the GPT models for better evaluation, as some models may not perform well in instruction tuning, leading to suboptimal results.
Save
harness.run()
Results : Implement the functionality to save the results obtained fromharness.run()
. This enables users to directly import the results and make necessary changes in the evaluation without rerunning the model. This enhancement facilitates a more efficient and flexible evaluation process.