Handle bad llm response with retries (llm-guided metrics)

With the initial text generation metric PR, if an LLM provides an invalid response for one of our LLM guided metrics (wrong formatted, wrong data type, etc.), then Valor will raise an error and the rest of the evaluation will not be completed. This seems like a poor user experience, although we should collect some user feedback on this.

Two improvements could be made:

if one metric fail, the evaluation continues with the rest of the metric computations. Then, when the user checks the evaluation, only the failed metric computations need to be rerun
if a metric computation fails due to a value error from the LLM response, we could prompt the LLM again, perhaps continuing the chat conversation by instructing the LLM to fix the formatting from its previous output. This type of retry might fix most of the failed metric computations, as the best LLMs are often good at fixing their own mistakes.

Striveworks / valor

Handle bad llm response with retries (llm-guided metrics) #743