flowaicom / flow-judge

Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafted for accuracy, speed, and customization.
Apache License 2.0
53 stars 8 forks source link

Add some option on how to handle errors in parsing #2

Closed R4ZZ3 closed 2 months ago

R4ZZ3 commented 2 months ago

During the weekend I tried this but had to change some code.

In more detail when running this on bigger datasets there might be some parsing errors and currently it then fails for the whole batch.

More specifically here: https://github.com/flowaicom/flow-judge/blob/main/flow_judge/eval_data_types.py#L37

I changed this to return feedback = 'Error' to later on filter these out from score based filtering. Maybe some warning would suffice or give user the option to error/warning in these cases and maybe option for default value if parsing fails so that it does not fail for the whole eval run (I ran it for 6600 samples in my testing for around 2 hours and it failed at the end) image

bergr7 commented 2 months ago

Hi @R4ZZ3 !

Thanks for reporting. Quite annoying indeed... sorry for that. We'll ship a fix soon! Note the lib is still under development so there are many things that we haven't implemented yet!

Could I ask which model config you were using for running the evals? From our testing, the quantized model with the vLLM engine provides really good speeds (2hr for 6600 sounds too long!)

bergr7 commented 2 months ago

@R4ZZ3 Raising parsing errors is now optional in batched evaluations and defaults to false so execution doesn't crash.

Thanks for reporting!