Open alexChiu127 opened 5 days ago
Thanks @alexChiu127. What API or service are you using for self-hosting the model?
Hi,
I'm not sure if my answer is accurate, but our company should be using HuggingFace-style APIs.
If you need more precise information, please let me know.
What version of arize-phoenix-evals
are you using? You can find this pip show arize-phoenix-evals
.
v0.15.1, thanks.
I tried to upgrade it to latest version, but the result is still "Not_Parsable".
hey @alexChiu127, there are a few reasons this might be happening
if you set provide_explanation
, i expected to see the unparsed output in the explanation column for each eval. it seems like that's not what you're seeing, though, is that correct?
Yes, I did set provide_explanation, but all the results I see are None. Function calling is supported, so maybe I’ll try adjusting the max tokens?
You certainly can try upping max tokens to see if it helps. I think the fact that explanations is not showing up is a separate bug we'll need to bottom out.
Can you also try running llm_classify
with the include_response
parameter set to true? This will include the raw response row in the dataframe so we can get a better sense of what is preventing parsing from working. It should be relatively straightforward to adapt this notebook to use your Llama 3 model and add include_response=True
.
I adjusted the max token to 1024, 2048, and 4096, but it seems to still result in the same outcome (Not_Parsable, None).
Below is my attempt to use llm_classify with the include_response parameter set to true. However, from the results(I tried HALLUCINATION for the same sample data of quick start), it seems like everything is being parsed correctly?
Here is the screenshot:
Ok, I see the response format is different when I switch model to gpt-4o, may be it causes the issue?
- I adjusted the max token to 1024, 2048, and 4096, but it seems to still result in the same outcome (Not_Parsable, None).
- Below is my attempt to use llm_classify with the include_response parameter set to true. However, from the results(I tried HALLUCINATION for the same sample data of quick start), it seems like everything is being parsed correctly? Here is the screenshot:
Can you also set provide_explanation
to true?
Yes, I can. When set provide_explanation to true, the result is different for llama 3, all data in explanation column is "None", and the data in response column is different too.
The max token is set to default: 256, llama 3
The max token is set to : 4096, llama 3
The max token is set to default: 256, gpt4-o, works great
Hi, It seems possible that the function calling format of our llama3 model is inconsistent with the format required by phoenix.eval(or some bugs exist in phoenix.eval? not sure), which is why the results cannot be parsed. When I do not use function calling, the correct results appear.
I would like to ask, what are the disadvantages of not using function calling? For example, could it be due to non-fixed JSON leading to parsing issues?
I tried the eval quick start, https://docs.arize.com/phoenix/evaluation/evals,
but I used our company's self-hosted llama3.1-70b-instruct model. I ran the following code to get the evaluation results for 10 sample data points. You can see that many entries in the
hallucination_eval
andqa_eval
columns are "NOT_PARSABLE". I would like to ask if this is because llama3.1 cannot parse the results, or should I not use theOpenAIModel
object?here is my notebook code:
Result: