DCGM / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
0 stars 3 forks source link

Changelog / Things to be aware of!

Introduction to 🇨🇿 BenCzechMark Fork

Welcome to 🇨🇿 BenCzechMark for of lm-evaluation-harness. Official readme corresponding to the forked version is here. The main differences of this fork include:

Obtaining Results For Leaderboard Submission

For executing single 🇨🇿 BenCzechMark task, you can run one of the following:

# CSMPT7b on 1 GPUs
python -m accelerate.commands.launch \
    --dynamo_backend=inductor \
    -m lm_eval \
    --model hf \
    --model_args pretrained=BUT-FIT/csmpt7b,\
dtype=bfloat16,max_length=2048,\
truncation=True,normalize_log_probs=True,\
trust_remote_code=True,truncate_strategy=leave_description \
    --tasks benczechmark_cs_sqad32 \
    --batch_size 2 \
    --output_path results_hf/eval_csmpt7b_benczechmark_cs_sqad32_chat_none_trunc_leave_description \
    --log_samples \
    --verbosity DEBUG \
    --num_fewshot 3

# Mistral Nemo on 8 GPUs, 1 node, using HF backend and pipeline parallelism
python -m lm_eval \
    --model hf \
    --model_args pretrained=mistralai/Mistral-Nemo-Instruct-2407,\
dtype=bfloat16,parallelize=True,max_length=2048,\
truncation=True,normalize_log_probs=True,\
trust_remote_code=True,truncate_strategy=leave_description \
    --tasks benczechmark_sentiment_mall \
    --batch_size 8 \
    --output_path results_hf/eval_mistral_nemo_instruct_benczechmark_sentiment_mall_chat_none_trunc_leave_description \
    --log_samples \
    --verbosity DEBUG \
    --num_fewshot 3
# Mixtral on 8 GPUs, 1 node, using VLLM backend and tensor parallelism
python -m lm_eval \
    --model vllm \
    --model_args pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,\
tensor_parallel_size=8,dtype=bfloat16,\
gpu_memory_utilization=0.8,max_length=2048,\
normalize_log_probs=True,trust_remote_code=True,\
truncate_strategy=leave_description \
    --tasks benczechmark_czechnews \
    --batch_size auto:4 \
    --output_path results_hf/eval_mixtralM_instruct_benczechmark_czechnews_chat_none_trunc_leave_description \
    --log_samples \
    --verbosity DEBUG \
    --num_fewshot 3

# See jobs/scripts/models/eval_L_vllm_master.sh for multinode evaluation with VLLM.

Notes & Tips: