Leo: we should design a new metric, considering different aspects of correctness:
1) similarity score like CodeBLEU,
2) passing compilation,
3) passing linking,
4) passing execution using unit testing,
each aspect has some points, the final metric is the sum of all the points of all aspects.
This aggregated metric can capture the progression of getting better and better translation quality, not just a 0 vs. 1 score for the final execution passing
Leo: we should design a new metric, considering different aspects of correctness:
1) similarity score like CodeBLEU, 2) passing compilation, 3) passing linking, 4) passing execution using unit testing,
each aspect has some points, the final metric is the sum of all the points of all aspects.
This aggregated metric can capture the progression of getting better and better translation quality, not just a 0 vs. 1 score for the final execution passing