citadel-ai / langcheck

Simple, Pythonic building blocks to evaluate LLM applications.
https://langcheck.readthedocs.io/en/latest/index.html
MIT License
186 stars 17 forks source link

Versioning of eval prompts #144

Closed yosukehigashi closed 2 months ago

yosukehigashi commented 2 months ago

I think it can be helpful to support using older eval prompts, so that users can choose to use previous versions easily if they want to.

Since toxicity is the only metric that we've updated the prompt for so far, this PR just adds eval_prompt_version as an argument to the toxicity metric for now.

Note: I also updated the v2 toxicity prompt for Prometheus to properly reflect that the output scores are no longer 1-5.

yosukehigashi commented 2 months ago

@liwii @kennysong @conan1024hao any thoughts on this approach to versioning eval prompts / metrics? Some things I considered are:

  1. is it backwards compatible: it adds a new param to the toxicity metric, but a default value is set for it so it won't break users' workflows
  2. is it easy to switch between versions: should be pretty easy, just requires changing the eval_prompt_version from e.g. "v2" to "v1"
  3. does it introduce clutter: I don't really like that we'll end up having a whole bunch of prompts in the prompts/ dir now, but IMO it's not as bad as introducing a ton of clutter in actual code. There's a bit of added clutter by toxicity_assessment_to_score supporting multiple versions, but this shouldn't change all that often so hopefully the added clutter will be minimal.
  4. is it easy to find relevant artifacts: it requires the user to manually find the corresponding prompt file in the prompts/ dir, so that's not great. But other than the prompt, everything is contained in the metric functions themselves.
  5. are the version names meaningful: I thought of tying the version names to something more meaningful like a date or the langcheck version, but I ended up going with v1, v2, for now since IMO it's better with respect to (2) and (4)
kennysong commented 2 months ago

I guess we'd need to do this eventually, and we should do it now since the Toxicity metric is frequently used in Lens? Do we plan to show both Toxicity v1 and v2 in Lens btw?

(Your points 1-5 make sense to me)

yosukehigashi commented 2 months ago

I'm not quite sure what we'll do in Lens yet, but while we figure it out I was planning to just continue using v1 under the hood - this allows us to cut a new LangCheck release (which is long overdue and prob something we should do soon to include the nltk upgrade) without changing things for Lens users

kennysong commented 2 months ago

Makes sense! Let's make an issue to discuss metric versioning in Lens separately then.

liwii commented 2 months ago

Thanks for the changes!!

Overall the changes look good to me! (I slightly prefer the naming like toxicity/v1.j2 so that it is easier to understand the prompts belong to the same group, but it is just a misc point)

yosukehigashi commented 2 months ago

Thanks for the changes!!

Overall the changes look good to me! (I slightly prefer the naming like toxicity/v1.j2 so that it is easier to understand the prompts belong to the same group, but it is just a misc point)

Thanks for the review! Yeah agreed that structure would look cleaner if we do that for all metrics, but I thought it would look a bit odd if just toxicity is a directory and everything else is a file. So I think I'll keep it as is for now, but we can always change it later!