Closed yananchen1989 closed 5 months ago
Hi,
This issue also exists in our experiments. However, the best solution we've come up with is to manually revise these instances. Of course, directly counting them as zero is a simpler method. However, identifying these exceptions poses a challenge since the exceptions arising during parsing by LLMs vary greatly. Given the rigorous nature of this research work, we cannot afford to assign a score of zero every time an exception occurs, especially considering the variety of other exceptions that might arise.
@hsaest very helpful. make sense. thanks.
hi team.
As we see in the code, temperature is zero, but you know, sometimes the response from llm is not deterministic. Therefore, sometimes, though not that often, LLM does not generate a valid plan.
for example, in the number 48 example in validation set, query:
most of time, it generates a valid plan, like this below, which can be parsed into json format for further eval.
However, a few times, under the exactly same prompt, it generates
and parsed by LLM, it will be json:
{"error": "Insufficient budget provided"}
so for this case, it will cause error in eval.py
is there any suggestion to not trigger this error and maybe for this type of cases, the eval system directly count them as 0 delivery ? thanks.