Report tests that fail or error during initial enumeration

kgilpin commented 3 weeks ago

When the solver encounters a test case that fails or errors during initial test selection, report the test path and the error.

This can be used to identify problems with the test harness that result in missed / failed solutions.

An example:

https://github.com/getappmap/navie-benchmark/actions/runs/11544982873/job/32131260977#step:7:449

error: patch with only garbage at line 3
Creating test database for alias 'default' ('file:memorydb_default?mode=memory&cache=shared')\u2026
Testing against Django installed in '/testbed/django'
Importing application aggregation_regress
Skipping setup of unused database(s): other.
Operations to perform:
  Synchronize unmigrated apps: aggregation_regress, auth, contenttypes, messages, sessions, staticfiles
  Apply all migrations: admin, sites
Synchronizing apps without migrations:
Traceback (most recent call last):
  File "./tests/runtests.py", line 503, in <module>
    options.exclude_tags,
  File "./tests/runtests.py", line 290, in django_tests
    extra_tests=extra_tests,
  File "/testbed/django/test/runner.py", line 629, in run_tests
    old_config = self.setup_databases(aliases=databases)
  File "/testbed/django/test/runner.py", line 554, in setup_databases
    self.parallel, **kwargs
  File "/testbed/django/test/utils.py", line 174, in setup_databases
    serialize=connection.settings_dict.get('TEST', {}).get('SERIALIZE', True),
  File "/testbed/django/db/backends/base/creation.py", line 72, in create_test_db
    run_syncdb=True,
  File "/testbed/django/core/management/__init__.py", line 148, in call_command
    return command.execute(*args, **defaults)
  File "/testbed/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/testbed/django/core/management/base.py", line 83, in wrapped
    res = handle_func(*args, **kwargs)
  File "/testbed/django/core/management/commands/migrate.py", line 203, in handle
    self.sync_apps(connection, executor.loader.unmigrated_apps)
  File "/testbed/django/core/management/commands/migrate.py", line 324, in sync_apps
    self.stdout.write("  Creating tables\u2026\n")
  File "/testbed/django/core/management/base.py", line 145, in write
    self._out.write(style_func(msg))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2026' in position 17: ordinal not in range(128)

github-actions[bot] commented 3 weeks ago

Title: Report Test Path and Error for Failing or Erroneous Tests in Initial Enumeration

Problem: When executing the initial test selection phase, some test cases fail or encounter errors, as observed in the provided error log. During this process, the specific path of the test and the corresponding error should be reported to aid in diagnosing and resolving issues within the test harness. These failures or errors may lead to undetected or failed solutions, making it imperative to report the issues promptly.

Analysis: In the example provided, a UnicodeEncodeError occurs during database setup, revealing a problem in encoding certain Unicode characters. Errors like these during the initial test enumeration can halt the process, leading to missed solutions or incorrect assumptions about test success. To effectively diagnose these issues, the system should report the specific test path that caused the error and the associated error message. This information can inform debugging efforts and allow for quick resolution of underlying issues in the test harness or the test environment configuration.

Proposed Changes:

solver/solve_instance.py:
- Enhance error handling to capture and log the test path and error message when a test case fails or errors during initial test selection. This should be done by modifying the report_error function to include additional information about the test path along with the error details.
solver/workflow/summarize_test_errors.py:
- Integrate a mechanism to capture the specific test path related to the encountered error when summarizing test errors. Update the summarize_test_errors function to extract and return the test path alongside the error messages.
swebench/harness/grading.py:
- Update the evaluation-related functions to ensure they can accommodate additional reporting of test paths and errors. Specifically, modify functions such as get_eval_report or utility functions to return more detailed error information when failures occur.
swebench/harness/run_evaluation.py:
- Adjust the error handling process within the test execution to ensure that when an error such as UnicodeEncodeError occurs, the system reports the full details, including the test path and error, to the calling functions.

kgilpin commented 3 weeks ago

@dividedmind this test fails without any modification by us; any thoughts on why it's blowing up on this encoding issue?

kgilpin commented 3 weeks ago

Tests can fail because of things like specific environment setup that's needed; we do need to filter these out, as we do, but I don't think we should spend time to track down each one.

dividedmind commented 3 weeks ago

why it's blowing up on this encoding issue?

My guess is that it's due to the container environment being incomplete and not setting locale correctly. Setting PYTHONIOENCODING=utf-8 in the environment might help.

getappmap / navie-benchmark

Report tests that fail or error during initial enumeration #80