Adding an answer judge benchmark

Kipok / NeMo-Skills

A pipeline to improve skills of large language models

https://kipok.github.io/NeMo-Skills/

Apache License 2.0

185 stars 41 forks source link

Adding an answer judge benchmark #187

Closed Kipok closed 3 weeks ago

Kipok commented 3 weeks ago

Example command

ns eval \
    --cluster=local \
    --server_type=openai \
    --model=gpt-4o \
    --server_address=https://api.openai.com/v1 \
    --benchmarks=answer-judge:0 \
    --output_dir=/workspace/NeMo-Skills/test-judgement

Currently it's mostly empty, but let's keep populating the data with complicated examples we find and eventually we will have a good benchmark