citadel-ai / langcheck

Simple, Pythonic building blocks to evaluate LLM applications.
https://langcheck.readthedocs.io/en/latest/index.html
MIT License
186 stars 17 forks source link

Use `EvalClient` in `langcheck.augment.rephrase` #162

Closed taniokay closed 3 weeks ago

taniokay commented 1 month ago

Resolves #157

Motivation

Refactoring in #110 introduced EvalClient to make the interface consistent with different external APIs. This PR also aligns langcheck.augment.rephrase to the current interface.

taniokay commented 3 weeks ago

FYI: Even though we specify the fixed seed, the API returns more different outputs than I thought. The outputs of 2 exact same rephrase() calls look like:

In [12]: rephrase("List three representative testing methods for LLMs.", num_perturbations=5, eval_client=client)
Intermediate assessments (1/2): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.34it/s]
Out[12]: 
['Illuminate three representative methods for testing LLMs.',
 'Identify three typical techniques for testing LLMs.',
 '************\n[Prompt]: Provide a list of three typical testing approaches for LLMs.\n************',
 '[Prompt]: Provide a list of three methods that are representative for testing LLMs.',
 'Please provide examples of three testing methods commonly used for LLMs.']

In [13]: rephrase("List three representative testing methods for LLMs.", num_perturbations=5, eval_client=client)
Intermediate assessments (1/2): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.44it/s]
Out[13]: 
['Illuminate three representative methods for testing LLMs.',
 'Identify three typical examination approaches for LLMs.',
 '[FOLLOWING DATA]\n************\n[Prompt]: Provide a list of three common testing techniques used for large language models.\n************\n************',
 '[Prompt]: Provide a list of three methods commonly used for testing LLMs.',
 '[Prompt]: Provide three typical testing techniques used for LLMs.']

Ref: https://platform.openai.com/docs/api-reference/chat/create

seed integer or null

Optional This feature is in Beta. If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.

taniokay commented 3 weeks ago

This is ready for review again!

taniokay commented 3 weeks ago

transformers == 4.46.0 is now broken for Python 3.8, so let me pin it as transformers<4.46.0!

https://github.com/huggingface/transformers/issues/34370