getappmap / navie-benchmark

Navie benchmarks
MIT License
0 stars 0 forks source link

Create a concise report of 0-scoring instances #84

Open kgilpin opened 3 weeks ago

kgilpin commented 3 weeks ago

Instances that score 0 are a great place to look when figuring what is going wrong with a solver, or when looking for ways to make a solver better.

For example, 0-scoring examples will often show cases in which the LLM output is not being handled properly, or where the options passed to Navie are not behaving as expected.

The solver should print 0-scoring solve logs to the console, and also collect them into a dedicated file. This 0-scoring log file should be uploaded as an Action output as well.

github-actions[bot] commented 3 weeks ago

Title: Enhance Logging for 0-Scoring Instances

Problem: Instances that score 0 indicate potential issues with the solver, such as improper handling of LLM output or unexpected behavior in the options passed to Navie. Identifying these instances can help diagnose and improve solver performance. The goal is to report these instances by logging them to the console and collecting them into a file dedicated to 0-scoring instances.

Analysis: To address this task, we need to implement a mechanism to track and report instances that result in a 0 score. The process should consist of two major enhancements:

  1. Modifying the current logging functionality to include additional logging for 0-scoring instances.
  2. Implementing a new feature to collect the details of 0-scoring instances into a specific log file. This log file will then be marked as an output artifact in the build process, ensuring its availability for further analysis.

The solution will involve:

Proposed Changes:

  1. solver/report.py:

    • Modify the logic in the section that collects evaluation reports to include a conditional check for instances with a score of 0.
    • Add logic to print instances with a 0 score to the console for immediate visibility during execution.
    • Ensure that these 0-scoring instances are aggregated and written into a separate dedicated log file.
  2. solver/report.py:

    • Within the Report class, introduce a function dedicated to handling 0-scoring instances. This function will append relevant data about these instances to a newly designated file, demarcating it as 0-scores.log.
  3. run_evaluation.py:

    • After the evaluation script finishes and reports are gathered, invoke the logic that collects 0-scoring instances and writes them to the specified file.
    • Ensure that this file is correctly uploaded as an Action output so that it is retained beyond the execution session for further analysis.
  4. modularize the logging setup:

    • Centralize the logging configurations to ensure consistency across various logging demands, especially when printing the 0-scoring results.
    • Ensure that the logging setup is flexible enough to include potential future enhancements in logging detail level and output destinations.

By making these changes, the user will gain clear visibility of instances where the solver's performance might be deficient. Collecting this data will provide actionable insights to enhance the solver's robustness over time.