Add Evaluation Support to Arcee Python SDK

rivinduw commented 1 month ago

This PR introduces support for evaluations in the Arcee Python SDK. Added start_evaluation function to arcee/api.py:

Allows users to initiate various types of evaluation jobs, including LLM-as-a-judge and lm-eval-harness benchmarks.

Usage Example for testing

import os
os.environ['ARCEE_API_URL'] = 'https://arcee-dev.dev.arcee.ai/api'
os.environ['ARCEE_ORG'] = 'rivinduorg'
os.environ['ARCEE_API_KEY'] = ''

openai_api_key = ''

import arcee
evaluation_params = {'evaluations_name': 'evals_test_oct7',
 'eval_type': 'llm_as_a_judge',
 'qa_set_name': 'mmlu_20q',
 'judge_model': {'model_name': 'gpt-4o',
  'custom_prompt': 'Evaluate which response better adheres to factual accuracy, clarity, and relevance.',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key},
 'deployment_model': {'model_name': 'gpt-4o-mini',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key},
 'reference_model': {'model_name': 'gpt-3.5-turbo-0125',
  'base_url': 'https://api.openai.com/v1',
  'api_key': openai_api_key}}

result = arcee.start_evaluation(**evaluation_params)
eval_status = arcee.get_evaluation_status(result['evaluations_id'])

Jacobsolawetz commented 4 weeks ago

Noticed evaluation with different params but same name resolves to same ID, should error

rivinduw commented 4 weeks ago

Noticed evaluation with different params but same name resolves to same ID, should error

Yup, params would get overwritten currently so we don't get two evaluations with the same name but different IDs. Should we error here or in platform? I think start pretraining might have the same behavior

rivinduw commented 4 weeks ago

I have a local branch of platform to raises an error when evaluations have duplicates but thinking we should be consistent across all the other services too.

Currently corpus uploader https://github.com/arcee-ai/arcee-platform/blob/eaec257eca5e1061813babd70006983200b7d57e/backend/app/api/v2/services/corpus.py#L171 has the same logic to update with new params.

Pretraining https://github.com/arcee-ai/arcee-platform/blob/eaec257eca5e1061813babd70006983200b7d57e/backend/app/api/v2/services/pretraining.py#L65, , deployment etc it seems to either assumes the existing params have not changed or look up each field in supabase separately and throw a X with this name does not exist error.

Any thoughts on the best consistent way to deal with repeated start_x calls @mryave @nason ?

arcee-ai / arcee-python

Add Evaluation Support to Arcee Python SDK #84

Usage Example for testing