bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

add odex and mconala datasets #45

Closed zorazrw closed 8 months ago

zorazrw commented 1 year ago

Added ODEX and MCoNaLa datasets in tasks. Followed the original code repository of ODEX and MCoNaLa to create the processing and evaluation functions.

loubnabnl commented 1 year ago

The current scores for this implementation for codegen-2b-mono are:

{
  "odex-en": {
    "pass@1": 0.3679726651480638,
    "pass@2": 0.40785644553949135,
    "pass@5": 0.44750070323076746,
    "pass@10": 0.4676729704105087
  },
  "config": {
    "model": "Salesforce/codegen-2B-mono",
    "temperature": 0.2,
    "n_samples": 50
  }
}

This is higher than number reported in Odex paper due to the original implementation not stripping the prompts (see issue https://github.com/zorazrw/odex/issues/5) However when stripping the prompt in that implementation the pass@1 in ~greedy mode (temp 1e-6) is

Overall Pass@K Scores: 
[pass@1] 0.4100 (439)

This implementation gives a pass@1 of

{
  "odex-en": {
    "pass@1": 0.3712984054669704
  },
  "config": {
    "model": "Salesforce/codegen-2B-mono",
    "temperature": 1e-06,
    "n_samples": 1
  }
}

So there is still a gap to be investigated