Evaluate CodeGen on safe and all-license dataset

harm-devries commented 2 years ago

As discussed in our 23.08 meeting, we can investigate if CodeGen is really trained on safe license data only. We can evaluate the loss of CodeGen on a subset of the python-safe-license dataset and python-all-license dataset and see if there are significant differences with the BigCode model trained on safe-license only.

lvwerra commented 2 years ago

Worked on some analysis on the two BigCode ("BigCode/gpt_345_python_safe_license_v2" and "BigCode/gpt_345_python_any_license") models as well as CodeGen ("Salesforce/codegen-350M-mono").

Loss ratio

First, I took 5'000 samples from both the save ("BigCode/python_safe_license") and any license dataset ("BigCode/python_any_license"). For each model and dataset I computed the average sample loss. Since the CodeGen model uses a different tokenizer the losses are not directly comparable, however, the ratio between the two different datasets might give some interesting insights:

model_name	avg_loss_any	avg_loss_safe	loss_ratio
gpt_345_python_safe_license_v2	1.061546	0.945958	1.122191
gpt_345_python_any_license	0.996761	1.009733	0.987153
codegen-350M-mono	0.969186	0.990381	0.978599

Observation 1: The any model performs similar on both datasets but the safe model shows a clear discrepancy. This is likely due to the fact that safe dataset is a subset of the any dataset.

Obersvation 2: The CodeGen model behaves much more similar to the any model. This might give some indication that the CodeGen model was trained on more licenses than safe model.

Difference `safe` and `any` model

Looked a bit closer at samples where the loss of the any model was significant lower than the safe model. What was striking is that there were a lot of short code snippets and a lot of solutions to exercises or coding challenges. The following is an example:

'''

Problem 56 | Merge Intervals
https://leetcode.com/problems/merge-intervals/

'''

class Solution:
    def merge(self, intervals: List[List[int]]) -> List[List[int]]:
        intervals = sorted(intervals, key = lambda x: x[0])
        output = []
        cur = intervals[0]
        for i in range(1, len(intervals)):
            if intervals[i][0] <= cur[1]:
                cur[1] = max(intervals[i][1], cur[1])
            else:
                output.append(cur)
                cur = intervals[i]
        output.append(cur)
        return output

A quick pattern matching showed that 10.2% of samples contained the word solution. It seemed that a lot of the code snippets with a license at the beginning were GPL code so another search showed that 15.2% of snippets had the word license at least once and 11.5% contained either GNU or GPL. So this seems to be quite a large fraction of GPL code.

Takeaways

CodeGen was likely trained on more licenses than the ones in safe
A lot of the data the safe model performs badly are short, self-contained snippets that are similar to HumanEval but different to normal codebases.
The licenses are predominantly of the samples with most difference are GPL
Maybe we can boost HumanEval performance by increasing the Leetcode style snippets (either with the PanGu-Coder approach or by adding or sampling similar files).

loubnabnl commented 2 years ago

Maybe we can train on CodeContests used in AlphaCode which has Leetcode style problems and their Python solutions. There's also CodeSearchNet where the examples are all function implementations with or without a docstring. But we probably already have the samples in our data as it is also from GitHub.

bigcode-project / bigcode-analysis