How to evaluate the perplexity of PolyCoder/CodeParrot?

nforest commented 2 years ago

Hello,

Thank you for the awesome PolyCoder project, I really think it helps a lot for evaluating all the Code PTMs. However, I only found a script to evaluate the perplexity of codex, it's hard to reproduce the perplexity benchmark results when compared among codex, Polycoder and other PTMs. Is it possible that you guys update a polycoder/codeparrot perplexity evalution script?

Thanks again, Sen

VHellendoorn commented 2 years ago

Hi! We are actually working on setting up a fork of GPT-NeoX here that contains the modifications we made to the original repository and the scripts needed to reproduce the perplexity and HumanEval results. We also made a fork of LM evaluation harness for the perplexity evaluation in particular. Instructions for replicating some of the other models' results, such as CodeParrot's, are still in the works, but hopefully this should be a useful place to start.

nforest commented 2 years ago

Thanks for your instruction! I have evaluated the perplexity of EleutherAI/gpt-neo-2.7B and lvwerra/codeparrot models. And the EleutherAI/gpt-neo-2.7B result is as equal as the result mentioned in the PolyCoder. However, the lvwerra/codeparrot result is slightly different with the paper.

Below is the perplexity of lvwerra/codeparrot I got, hope the information is useful for you.

{
  "results": {
    "code_python": {
      "word_perplexity": 2.947644345529426,
      "byte_perplexity": 1.3086303012893996,
      "bits_per_byte": 0.26898101868702134,
      "num_pygments_tokens": 79653,
      "num_model_tokens": 84167
    },
    "code_c++": {
      "word_perplexity": 8.480947498095638,
      "byte_perplexity": 1.6218670760842977,
      "bits_per_byte": 0.4835780017088815,
      "num_pygments_tokens": 69627,
      "num_model_tokens": 94886
    },
    "code_c#": {
      "word_perplexity": 7.161291858450508,
      "byte_perplexity": 1.466587462575848,
      "bits_per_byte": 0.3829382480087081,
      "num_pygments_tokens": 67306,
      "num_model_tokens": 85839
    },
    "code_c": {
      "word_perplexity": 19.239165781067225,
      "byte_perplexity": 1.8399389638106254,
      "bits_per_byte": 0.6097323992286451,
      "num_pygments_tokens": 54841,
      "num_model_tokens": 99147
    },
    "code_php": {
      "word_perplexity": 19.909012881134213,
      "byte_perplexity": 1.588113142529397,
      "bits_per_byte": 0.4625466087310989,
      "num_pygments_tokens": 45682,
      "num_model_tokens": 86911
    },
    "code_go": {
      "word_perplexity": 10.004139279782196,
      "byte_perplexity": 1.931687199189394,
      "bits_per_byte": 0.6583938175007944,
      "num_pygments_tokens": 79947,
      "num_model_tokens": 95795
    },
    "code_scala": {
      "word_perplexity": 12.90566311187167,
      "byte_perplexity": 1.8225096617714456,
      "bits_per_byte": 0.6002144862701331,
      "num_pygments_tokens": 65756,
      "num_model_tokens": 79058
    },
    "code_java": {
      "word_perplexity": 6.792650761831589,
      "byte_perplexity": 1.4973542265188218,
      "bits_per_byte": 0.4036997017074563,
      "num_pygments_tokens": 65484,
      "num_model_tokens": 81224
    },
    "code_javascript": {
      "word_perplexity": 9.234540763938018,
      "byte_perplexity": 1.8044722998625062,
      "bits_per_byte": 0.5902681943940824,
      "num_pygments_tokens": 54620,
      "num_model_tokens": 61554
    },
    "code_typescript": {
      "word_perplexity": 12.54418652020766,
      "byte_perplexity": 1.8121801214461348,
      "bits_per_byte": 0.5945306074520582,
      "num_pygments_tokens": 55895,
      "num_model_tokens": 66215
    },
    "code_ruby": {
      "word_perplexity": 14.262475649567447,
      "byte_perplexity": 1.8715695028215535,
      "bits_per_byte": 0.6267773851675859,
      "num_pygments_tokens": 46537,
      "num_model_tokens": 58519
    },
    "code_rust": {
      "word_perplexity": 8.679620501082228,
      "byte_perplexity": 1.811500318792854,
      "bits_per_byte": 0.5941554073357859,
      "num_pygments_tokens": 107717,
      "num_model_tokens": 112252
    }
  },
  "versions": {
    "code_python": 0,
    "code_c++": 0,
    "code_c#": 0,
    "code_c": 0,
    "code_php": 0,
    "code_go": 0,
    "code_scala": 0,
    "code_java": 0,
    "code_javascript": 0,
    "code_typescript": 0,
    "code_ruby": 0,
    "code_rust": 0
  }
}
gpt2 (pretrained=lvwerra/codeparrot), limit: None, provide_description: False, num_fewshot: 0, batch_size: 1

VHellendoorn commented 2 years ago

Hi, thanks for sharing these results! We've replicated them on our end and are getting the same numbers: GPT-Neo and the likes remain unchanged, but CodeParrot perplexities are generally lower than what we reported in the paper. We are not entirely sure why; perhaps something changed when we finalized the evaluation for release. In any case, we will update the results in the paper and will acknowledge your contribution; thanks so much!

On a tangent, it looks like the main difference is that CodeParrot scores a bit better on Python now than PolyCoder. That makes a lot of sense to me; training on mono-lingual data typically comes with a significant boost on that dataset, as supported by Salesforce's recent work.

VHellendoorn / Code-LMs

How to evaluate the perplexity of PolyCoder/CodeParrot? #18