Closed yosukehigashi closed 2 months ago
@liwii @kennysong @conan1024hao any thoughts on this approach to versioning eval prompts / metrics? Some things I considered are:
toxicity
metric, but a default value is set for it so it won't break users' workflowseval_prompt_version
from e.g. "v2"
to "v1"
prompts/
dir now, but IMO it's not as bad as introducing a ton of clutter in actual code. There's a bit of added clutter by toxicity_assessment_to_score
supporting multiple versions, but this shouldn't change all that often so hopefully the added clutter will be minimal.prompts/
dir, so that's not great. But other than the prompt, everything is contained in the metric functions themselves.I guess we'd need to do this eventually, and we should do it now since the Toxicity metric is frequently used in Lens? Do we plan to show both Toxicity v1 and v2 in Lens btw?
(Your points 1-5 make sense to me)
I'm not quite sure what we'll do in Lens yet, but while we figure it out I was planning to just continue using v1 under the hood - this allows us to cut a new LangCheck release (which is long overdue and prob something we should do soon to include the nltk upgrade) without changing things for Lens users
Makes sense! Let's make an issue to discuss metric versioning in Lens separately then.
Thanks for the changes!!
Overall the changes look good to me! (I slightly prefer the naming like toxicity/v1.j2
so that it is easier to understand the prompts belong to the same group, but it is just a misc point)
Thanks for the changes!!
Overall the changes look good to me! (I slightly prefer the naming like
toxicity/v1.j2
so that it is easier to understand the prompts belong to the same group, but it is just a misc point)
Thanks for the review! Yeah agreed that structure would look cleaner if we do that for all metrics, but I thought it would look a bit odd if just toxicity is a directory and everything else is a file. So I think I'll keep it as is for now, but we can always change it later!
I think it can be helpful to support using older eval prompts, so that users can choose to use previous versions easily if they want to.
Since toxicity is the only metric that we've updated the prompt for so far, this PR just adds
eval_prompt_version
as an argument to thetoxicity
metric for now.Note: I also updated the v2 toxicity prompt for Prometheus to properly reflect that the output scores are no longer 1-5.