Arize-ai / phoenix

AI Observability & Evaluation
https://docs.arize.com/phoenix
Other
4.01k stars 297 forks source link

[BUG]Unable to Parse Results Using llama3.1-70b-instruct Model in Evaluation #5430

Open alexChiu127 opened 5 days ago

alexChiu127 commented 5 days ago

I tried the eval quick start, https://docs.arize.com/phoenix/evaluation/evals,

but I used our company's self-hosted llama3.1-70b-instruct model. I ran the following code to get the evaluation results for 10 sample data points. You can see that many entries in the hallucination_eval and qa_eval columns are "NOT_PARSABLE". I would like to ask if this is because llama3.1 cannot parse the results, or should I not use the OpenAIModel object?

here is my notebook code:

import nest_asyncio

from phoenix.evals import HallucinationEvaluator, OpenAIModel, QAEvaluator, run_evals

nest_asyncio.apply()  # This is needed for concurrency in notebook environments

MODEL = "llama3.1-70b-instruct"
API_KEY = "SkJblk8gWx1CpVSvXUkbFzP3hs"
API_BASE = f"https://mycompany.inc/llm/v3/models/{MODEL}"

eval_model = OpenAIModel(
    api_key=API_KEY,
    base_url=API_BASE,
    model_kwargs={
        "extra_headers": {
            "x-user-id": "user123"
        }
    }
)

# Define your evaluators
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_evaluator = QAEvaluator(eval_model)

# We have to make some minor changes to our dataframe to use the column names expected by our evaluators
# for `hallucination_evaluator` the input df needs to have columns 'output', 'input', 'context'
# for `qa_evaluator` the input df needs to have columns 'output', 'input', 'reference'
df["context"] = df["reference"]
df.rename(columns={"query": "input", "response": "output"}, inplace=True)
assert all(column in df.columns for column in ["output", "input", "context", "reference"])

# Run the evaluators, each evaluator will return a dataframe with evaluation results
# We upload the evaluation results to Phoenix in the next step
hallucination_eval_df, qa_eval_df = run_evals(
    dataframe=df, evaluators=[hallucination_evaluator, qa_evaluator], provide_explanation=True
)

results_df = df.copy()
results_df["hallucination_eval"] = hallucination_eval_df["label"]
results_df["hallucination_explanation"] = hallucination_eval_df["explanation"]
results_df["qa_eval"] = qa_eval_df["label"]
results_df["qa_explanation"] = qa_eval_df["explanation"]
results_df.head()

Result: image

axiomofjoy commented 5 days ago

Thanks @alexChiu127. What API or service are you using for self-hosting the model?

alexChiu127 commented 4 days ago

Hi,

I'm not sure if my answer is accurate, but our company should be using HuggingFace-style APIs.

If you need more precise information, please let me know.

axiomofjoy commented 3 days ago

What version of arize-phoenix-evals are you using? You can find this pip show arize-phoenix-evals.

alexChiu127 commented 3 days ago

v0.15.1, thanks.

alexChiu127 commented 3 days ago

I tried to upgrade it to latest version, but the result is still "Not_Parsable".

axiomofjoy commented 3 days ago

hey @alexChiu127, there are a few reasons this might be happening

if you set provide_explanation, i expected to see the unparsed output in the explanation column for each eval. it seems like that's not what you're seeing, though, is that correct?

alexChiu127 commented 2 days ago

Yes, I did set provide_explanation, but all the results I see are None. Function calling is supported, so maybe I’ll try adjusting the max tokens?

axiomofjoy commented 2 days ago

You certainly can try upping max tokens to see if it helps. I think the fact that explanations is not showing up is a separate bug we'll need to bottom out.

Can you also try running llm_classify with the include_response parameter set to true? This will include the raw response row in the dataframe so we can get a better sense of what is preventing parsing from working. It should be relatively straightforward to adapt this notebook to use your Llama 3 model and add include_response=True.

alexChiu127 commented 2 days ago
  1. I adjusted the max token to 1024, 2048, and 4096, but it seems to still result in the same outcome (Not_Parsable, None).

  2. Below is my attempt to use llm_classify with the include_response parameter set to true. However, from the results(I tried HALLUCINATION for the same sample data of quick start), it seems like everything is being parsed correctly? Here is the screenshot:
    image image

alexChiu127 commented 2 days ago

Ok, I see the response format is different when I switch model to gpt-4o, may be it causes the issue?

image

axiomofjoy commented 2 days ago
  1. I adjusted the max token to 1024, 2048, and 4096, but it seems to still result in the same outcome (Not_Parsable, None).
  2. Below is my attempt to use llm_classify with the include_response parameter set to true. However, from the results(I tried HALLUCINATION for the same sample data of quick start), it seems like everything is being parsed correctly? Here is the screenshot: image image

Can you also set provide_explanation to true?

alexChiu127 commented 2 days ago

Yes, I can. When set provide_explanation to true, the result is different for llama 3, all data in explanation column is "None", and the data in response column is different too.

The max token is set to default: 256, llama 3 image

The max token is set to : 4096, llama 3 image

The max token is set to default: 256, gpt4-o, works great image

alexChiu127 commented 2 days ago

Hi, It seems possible that the function calling format of our llama3 model is inconsistent with the format required by phoenix.eval(or some bugs exist in phoenix.eval? not sure), which is why the results cannot be parsed. When I do not use function calling, the correct results appear.

image image

I would like to ask, what are the disadvantages of not using function calling? For example, could it be due to non-fixed JSON leading to parsing issues?