bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.
Apache License 2.0
113 stars 20 forks source link

Evaluate CodeGen on safe and all-license dataset #23

Closed harm-devries closed 1 year ago

harm-devries commented 2 years ago

As discussed in our 23.08 meeting, we can investigate if CodeGen is really trained on safe license data only. We can evaluate the loss of CodeGen on a subset of the python-safe-license dataset and python-all-license dataset and see if there are significant differences with the BigCode model trained on safe-license only.

lvwerra commented 2 years ago

Worked on some analysis on the two BigCode ("BigCode/gpt_345_python_safe_license_v2" and "BigCode/gpt_345_python_any_license") models as well as CodeGen ("Salesforce/codegen-350M-mono").

Loss ratio

First, I took 5'000 samples from both the save ("BigCode/python_safe_license") and any license dataset ("BigCode/python_any_license"). For each model and dataset I computed the average sample loss. Since the CodeGen model uses a different tokenizer the losses are not directly comparable, however, the ratio between the two different datasets might give some interesting insights:

model_name avg_loss_any avg_loss_safe loss_ratio
gpt_345_python_safe_license_v2 1.061546 0.945958 1.122191
gpt_345_python_any_license 0.996761 1.009733 0.987153
codegen-350M-mono 0.969186 0.990381 0.978599

Observation 1: The any model performs similar on both datasets but the safe model shows a clear discrepancy. This is likely due to the fact that safe dataset is a subset of the any dataset.

Obersvation 2: The CodeGen model behaves much more similar to the any model. This might give some indication that the CodeGen model was trained on more licenses than safe model.

Difference safe and any model

Looked a bit closer at samples where the loss of the any model was significant lower than the safe model. What was striking is that there were a lot of short code snippets and a lot of solutions to exercises or coding challenges. The following is an example:

'''

Problem 56 | Merge Intervals
https://leetcode.com/problems/merge-intervals/

'''

class Solution:
    def merge(self, intervals: List[List[int]]) -> List[List[int]]:
        intervals = sorted(intervals, key = lambda x: x[0])
        output = []
        cur = intervals[0]
        for i in range(1, len(intervals)):
            if intervals[i][0] <= cur[1]:
                cur[1] = max(intervals[i][1], cur[1])
            else:
                output.append(cur)
                cur = intervals[i]
        output.append(cur)
        return output

A quick pattern matching showed that 10.2% of samples contained the word solution. It seemed that a lot of the code snippets with a license at the beginning were GPL code so another search showed that 15.2% of snippets had the word license at least once and 11.5% contained either GNU or GPL. So this seems to be quite a large fraction of GPL code.

Takeaways

loubnabnl commented 2 years ago

Maybe we can train on CodeContests used in AlphaCode which has Leetcode style problems and their Python solutions. There's also CodeSearchNet where the examples are all function implementations with or without a docstring. But we probably already have the samples in our data as it is also from GitHub.