Closed R4ZZ3 closed 2 months ago
Hi @R4ZZ3 !
Thanks for reporting. Quite annoying indeed... sorry for that. We'll ship a fix soon! Note the lib is still under development so there are many things that we haven't implemented yet!
Could I ask which model config you were using for running the evals? From our testing, the quantized model with the vLLM engine provides really good speeds (2hr for 6600 sounds too long!)
@R4ZZ3 Raising parsing errors is now optional in batched evaluations and defaults to false so execution doesn't crash.
Thanks for reporting!
During the weekend I tried this but had to change some code.
In more detail when running this on bigger datasets there might be some parsing errors and currently it then fails for the whole batch.
More specifically here: https://github.com/flowaicom/flow-judge/blob/main/flow_judge/eval_data_types.py#L37
I changed this to return feedback = 'Error' to later on filter these out from score based filtering. Maybe some warning would suffice or give user the option to error/warning in these cases and maybe option for default value if parsing fails so that it does not fail for the whole eval run (I ran it for 6600 samples in my testing for around 2 hours and it failed at the end)