pre-release v0.4
pre-release v0.3
If you done the evaluation with older version (v0.2), please reevaluate subjectivity task.
And if you have used the first public version (v0.1), please reevaluate subjectivity, belebele, and snli in order to be comparable with benchmark. Be sure to extract leaderboard results (using compile_log_files.py) on the new results, not the older results.
pre-release v0.2
Fixes for belebele, snli, and grammarerrorcorrection tasks. The first critical big in prompt (only answer choices were shown to model; not question and neither context). The latter two tasks were using wrong metrics.
Please reevaluate this tasks before submitting your results to leaderboard.
If the leaderboard doesn't show up (or shows something like Results dataset integrity solving
), it means the model tournament is being recomputed (~ 5 hours). This gets done everytime we fix some crucial bug (so after v0.2, v0.3).
Welcome to 🇨🇿 BenCzechMark for of lm-evaluation-harness. Official readme corresponding to the forked version is here. The main differences of this fork include:
Extra switch for aggregating per-token log-probability with average, instead of sum.
Smart truncation switch which prevents task description from being truncated.
<description><few_shot_examples><current_sample><current_continuation>
<few_shot_samples>
,<few_shot_samples>
left, it does leftmost truncation from description
, followed by current_sample
,<description><few_shot_examples><current_sample>
) comes close to 20% of the defined max_input_length, the rightmost truncation from suffix is made further.outputh_path
and log_samples
parameters. Your output paths across all tasks (as matched by glob expression) are then aggregated by compile_log_files.py script (see leaderboard instructions).
We ran all experiments on slurm cluster (Karolina, 8x 40GB A100 GPUs per node). For comprehensive survey, check out the job scripts, starting with jobs/run_benchmark.sh. We follow standard lm-harness options, together with our custom functionality described in introduction. We didn't used chat_templates.
For executing single 🇨🇿 BenCzechMark task, you can run one of the following:
# CSMPT7b on 1 GPUs
python -m accelerate.commands.launch \
--dynamo_backend=inductor \
-m lm_eval \
--model hf \
--model_args pretrained=BUT-FIT/csmpt7b,\
dtype=bfloat16,max_length=2048,\
truncation=True,normalize_log_probs=True,\
trust_remote_code=True,truncate_strategy=leave_description \
--tasks benczechmark_cs_sqad32 \
--batch_size 2 \
--output_path results_hf/eval_csmpt7b_benczechmark_cs_sqad32_chat_none_trunc_leave_description \
--log_samples \
--verbosity DEBUG \
--num_fewshot 3
# Mistral Nemo on 8 GPUs, 1 node, using HF backend and pipeline parallelism
python -m lm_eval \
--model hf \
--model_args pretrained=mistralai/Mistral-Nemo-Instruct-2407,\
dtype=bfloat16,parallelize=True,max_length=2048,\
truncation=True,normalize_log_probs=True,\
trust_remote_code=True,truncate_strategy=leave_description \
--tasks benczechmark_sentiment_mall \
--batch_size 8 \
--output_path results_hf/eval_mistral_nemo_instruct_benczechmark_sentiment_mall_chat_none_trunc_leave_description \
--log_samples \
--verbosity DEBUG \
--num_fewshot 3
# Mixtral on 8 GPUs, 1 node, using VLLM backend and tensor parallelism
python -m lm_eval \
--model vllm \
--model_args pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,\
tensor_parallel_size=8,dtype=bfloat16,\
gpu_memory_utilization=0.8,max_length=2048,\
normalize_log_probs=True,trust_remote_code=True,\
truncate_strategy=leave_description \
--tasks benczechmark_czechnews \
--batch_size auto:4 \
--output_path results_hf/eval_mixtralM_instruct_benczechmark_czechnews_chat_none_trunc_leave_description \
--log_samples \
--verbosity DEBUG \
--num_fewshot 3
# See jobs/scripts/models/eval_L_vllm_master.sh for multinode evaluation with VLLM.
Notes & Tips:
truncate_strategy
(options are None / "leave_description"), and normalize_log_probs
(True triggers averaging). batch_size: auto
sometimes causes CUDA OOM errors. We usually ran all tasks with auto, and those which failed were rerun with fixed batch size.