Closed ryanhoangt closed 1 week ago
@ryanhoangt Can you please post a traceback from the logs if you have, by any chance, or the .jsonl ? I made a quick fix in the linked PR, I'd like to look at it some more though.
Unfortunately from the trajectory in the jsonl file there're no traceback. There's only one last entry from the history
field beside the error
field above. I can try capturing the traceback (if having any) from the log directly next time.
{
"id": 84,
"timestamp": "2024-10-02T10:06:45.050451",
"source": "agent",
"message": "There was an unexpected error while running the agent",
"observation": "error",
"content": "There was an unexpected error while running the agent",
"extras": {}
}
I'm also quite confused about whether it is litellm.APIError
or OpenAIException
. From the doc seems to me like OpenAIException
is a provider-specific exception and litellm.APIError
is a wrapper for all providers.
The linked PR added retries from our LLM class, but I think a better fix will retry the eval or make sure it's not in jsonl so that it will be attempted again.
Thanks for the fix! Btw can you explain why retrying the whole eval is better? Not sure about the architectural side, but imo it may be not necessary to run again from the first step (especially when we're at the very end of the trajectory).
Oh, they're not exclusive. The request is retried now, and we can configure the retry settings to make more attempts (in config.toml
for the respective llm.eval group). You may want to do that, give it as much time as you see fit... That will retry from the current state.
But well, there will be a limit, so my thinking here is simply that if the proxy continues to be unavailable at that time I'm guessing the reasonable thing is to give it up, just don't save it in the jsonl so we can rerun it. 🤔
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
I think I saw another error merged into the jsonl, but... when it was only 1 task and 1 worker. We usually use multiprocessing lately, which might be why we don't see it. Maybe.
On the other hand, we have meanwhile made more fixes and added some retry when inference ends abnormally, before it gets to the output file, maybe it was fixed.
Yeah, from my side I can see the retries happen after your fix. Recently with the new LLM proxy I don't even receive 502 errors anymore. Maybe this PR can be closed.
Is there an existing issue for the same bug?
Describe the bug
When running the eval via the All Hands AI's LLM proxy, sometimes the server crashed with 502 response. The eval result is still collected into the
output.jsonl
file with theerror
field being:Then we have to manually filter out instances with that error and rerun. Maybe we should have some kind of logic to automatically retry for this scenario.
Current OpenHands version
Installation and Configuration
Model and Agent
Operating System
Linux
Reproduction Steps
No response
Logs, Errors, Screenshots, and Additional Context
No response