evalplus / repoqa

RepoQA: Evaluating Long-Context Code Understanding
https://evalplus.github.io/repoqa.html
Apache License 2.0
99 stars 3 forks source link

[Tracking] Evaluating models on base dataset using 16k context #25

Closed ganler closed 6 months ago

ganler commented 6 months ago

OSS model 🤗

CodeLlama

🤔 marks models with 8~16k context trained. May need to modify the config.json.

DeepSeekCoder

Llama 3

CodeQwen

Qwen1.5

CodeGemma

Mistral

Starcoder2

Private model 💲💰💸

ganler commented 6 months ago

It seems vLLM has a very restrictive context size limit for Llama-3 (trained on 8k max) that anything beyond 8k is rejected. DS series are fine and its ctx size can be extended as is shown in the CodeQwen report.

be563098fd2701ce06049a6c05c01d0

ganler commented 6 months ago

*CodeQwen TypeScript results are missing. will catch that up soon.

ganler commented 6 months ago

CodeQwen results updated.

ganler commented 6 months ago

Running databricks/dbrx-instruct as well.

ganler commented 6 months ago

databricks/dbrx-instruct produces empty output all the time. I think I will skip it then.

ganler commented 6 months ago

Added bigcode/starcoder2-instruct-15b-v0.1.

ganler commented 6 months ago

Got rate limited by Gemini Pro and Claude....