Why GPT-CC is lower than CodeX significantly.

CodedotAl / gpt-code-clippy

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57

Apache License 2.0

3.29k stars 220 forks source link

Why GPT-CC is lower than CodeX significantly. #73

Closed BitcoinNLPer closed 2 years ago

BitcoinNLPer commented 2 years ago

This is Codex results

The following result is GPT-CC

Is it caused by the quality of the pre-trained corpus data?

Thanks

Symbolk commented 2 years ago

Also curious, as I know, the HumanEval dataset contains 164 problems, according to the latest result in README, the model does not even pass any one of them!

taisazero commented 2 years ago

Ya that's correct we discovered issues with the pre-training corpus data. We fixed the issue and released a new pertaining corpus and are in the process of processing the dataset further and pre-training a new GPT-CC.

In the meantime, check out these awesome models: https://huggingface.co/spaces/codeparrot/code-generation-models