Open sh0416 opened 1 month ago
Difficulty level: introductory, pass@k: {1: 0.17739999999999997}, num of examples: 1000, total_cost: 0
Difficulty level: interview, pass@k: {1: nan}, num of examples: 0, total_cost: 0
Difficulty level: competition, pass@k: {1: nan}, num of examples: 0, total_cost: 0
Number of outputs distributions:
/data2/seonghyeon/repositories/CodeChain/src/evaluate.py:111: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
print(describe(num_outputs))
/data2/seonghyeon/anaconda3/envs/codefeedbackbench/lib/python3.10/site-packages/scipy/stats/_stats_py.py:1418: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
sk = skew(a, axis, bias=bias)
/data2/seonghyeon/anaconda3/envs/codefeedbackbench/lib/python3.10/site-packages/scipy/stats/_stats_py.py:1419: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
kurt = kurtosis(a, axis, bias=bias)
DescribeResult(nobs=1000, minmax=(5, 5), mean=5.0, variance=0.0, skewness=nan, kurtosis=nan)
pass@1 0.17739999999999997 nan nan
Here is the result.
With greedy decoding, the score increased to 0.213, but still far from 0.26.
I am trying to reproduce the result reported in your paper, but the APPS introductory pass@1 is about 0.18, which is different from 0.26.
I use the checkpoint
WizardLMTeam/WizardCoder-15B-V1.0
.Could anyone help reproducing this result???