Open sfc-gh-hhan opened 1 month ago
The issue you observed might be due to two main factors: the recent update to GPT-4 and the ongoing updates to our benchmarks, which could result in discrepancies between the current cases and those used in previous evaluation.
I'm using this command to evaluate Pass@1:
After rerun: