Closed harm-devries closed 1 year ago
Worked on some analysis on the two BigCode ("BigCode/gpt_345_python_safe_license_v2"
and "BigCode/gpt_345_python_any_license"
) models as well as CodeGen ("Salesforce/codegen-350M-mono"
).
First, I took 5'000 samples from both the save ("BigCode/python_safe_license"
) and any license dataset ("BigCode/python_any_license"
). For each model and dataset I computed the average sample loss. Since the CodeGen model uses a different tokenizer the losses are not directly comparable, however, the ratio between the two different datasets might give some interesting insights:
model_name | avg_loss_any | avg_loss_safe | loss_ratio |
---|---|---|---|
gpt_345_python_safe_license_v2 | 1.061546 | 0.945958 | 1.122191 |
gpt_345_python_any_license | 0.996761 | 1.009733 | 0.987153 |
codegen-350M-mono | 0.969186 | 0.990381 | 0.978599 |
Observation 1: The any
model performs similar on both datasets but the safe
model shows a clear discrepancy. This is likely due to the fact that safe
dataset is a subset of the any
dataset.
Obersvation 2: The CodeGen model behaves much more similar to the any
model. This might give some indication that the CodeGen model was trained on more licenses than safe
model.
safe
and any
modelLooked a bit closer at samples where the loss of the any
model was significant lower than the safe
model. What was striking is that there were a lot of short code snippets and a lot of solutions to exercises or coding challenges. The following is an example:
'''
Problem 56 | Merge Intervals
https://leetcode.com/problems/merge-intervals/
'''
class Solution:
def merge(self, intervals: List[List[int]]) -> List[List[int]]:
intervals = sorted(intervals, key = lambda x: x[0])
output = []
cur = intervals[0]
for i in range(1, len(intervals)):
if intervals[i][0] <= cur[1]:
cur[1] = max(intervals[i][1], cur[1])
else:
output.append(cur)
cur = intervals[i]
output.append(cur)
return output
A quick pattern matching showed that 10.2% of samples contained the word solution
. It seemed that a lot of the code snippets with a license at the beginning were GPL code so another search showed that 15.2% of snippets had the word license
at least once and 11.5% contained either GNU
or GPL
. So this seems to be quite a large fraction of GPL code.
safe
safe
model performs badly are short, self-contained snippets that are similar to HumanEval but different to normal codebases. Maybe we can train on CodeContests used in AlphaCode which has Leetcode style problems and their Python solutions. There's also CodeSearchNet where the examples are all function implementations with or without a docstring. But we probably already have the samples in our data as it is also from GitHub.
As discussed in our 23.08 meeting, we can investigate if CodeGen is really trained on safe license data only. We can evaluate the loss of CodeGen on a subset of the python-safe-license dataset and python-all-license dataset and see if there are significant differences with the BigCode model trained on safe-license only.