UKGovernmentBEIS / inspect_ai

Inspect: A framework for large language model evaluations
https://inspect.ai-safety-institute.org.uk/
MIT License
613 stars 114 forks source link

Validation error when reading Eval log #834

Open us2547 opened 2 days ago

us2547 commented 2 days ago

When using read_eval_log to read the EvalLog (json) receiving error below. The problematic log is not consistent, it happens only raraly. Also, when using EvalLog object as an output from the run (not reading from file), the iterations over the object work fine.

One observation, it seems that the problem is related to the scorer that outputs "string" and doesn't have "metric". Seems that the problem is related to the issue #775 .

@scorer(metrics=[])
def problem_type(model: Model):
......

Unfortunately problematic file is very large. Example error trace. The place "2" is used by scorer that outputs "string".

File "/usr/local/stage3technical/var/virtualenv/tcom-middle-tier-10-26-24/lib/python3.11/site-packages/inspect_ai/log/_file.py", line 201, in _read_header_streaming
    results = EvalResults(**v)
              ^^^^^^^^^^^^^^^^
  File "/usr/local/stage3technical/var/virtualenv/tcom-middle-tier-10-26-24/lib/python3.11/site-packages/pydantic/main.py", line 193, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 8 validation errors for EvalResults
scores.2.metrics.accuracy.value.int
  Input should be a valid integer [type=int_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.8/v/int_type
scores.2.metrics.accuracy.value.float
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.8/v/float_type
sample_reductions.2.samples.165.value.str
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
sample_reductions.2.samples.165.value.int
  Input should be a valid integer [type=int_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.8/v/int_type
sample_reductions.2.samples.165.value.float
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.8/v/float_type
sample_reductions.2.samples.165.value.bool
  Input should be a valid boolean [type=bool_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.8/v/bool_type
sample_reductions.2.samples.165.value.list[union[str,int,float,bool]]
  Input should be a valid list [type=list_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.8/v/list_type
sample_reductions.2.samples.165.value.dict[str,nullable[union[str,int,float,bool]]]
  Input should be a valid dictionary [type=dict_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.8/v/dict_type
jjallaire-aisi commented 2 days ago

@dragonstyle Could you take a look at this?

us2547 commented 2 days ago

It seems that the problem is not linked only to the scorer that outputs string. After commenting out the problem was reproduced (with different scorer failing validation). The run on a same dataset but with the limit number of samples works. I suspect the problem somehow related to the specific sample. Is there a way to debug the EvalLog to identify the root cause? The problem was reproduced with latest inspect version.

us2547 commented 1 day ago

I believe the root cause of the problem is when scorer output "value" as null. After changing null values to zero, the log parser works.

"metrics": {
          "accuracy": {
            "name": "accuracy",
            "value": null,
            "options": {}
          }
        }

or from "samples":

{
            "value": null,
            "answer": "Answer .....",
            "explanation": "Explanation .....",
            "metadata": {
              "faithfulness_score": null
            },
            "sample_id": "id-2024-5"
          },
dragonstyle commented 1 day ago

Is the issue that the value for some cases is being returned as NaN? I could see that we would serialize that to null and that our type validation wouldn't allow that to pass since null isn't a valid value for a score.

us2547 commented 11 hours ago

The value returned was numpy.nan which was converted to string value null by inspect-ai. The Score class allows to set value to numpy.nan and there are no warnings or errors when doing so, only when reading the log with inspect utility.