Open rizkidotme opened 4 months ago
Hi, thanks for your request. I've been diligently working to evaluate Gemini, but I'm encountering significant usability issues with their API. It frequently generates errors that, surprisingly, even Google hasn't resolved for quite some time. I also tested the API on AI Studio, but keep running into problems with truncation, which shortens the responses too much, leaving them incomplete and hard to evaluate.
Do you perhaps know of a simpler way to assess Gemini? Alternatively, you could try running the results and submit the result file using the instructions provided in the markdown. This would give you a precise measure of the performance. If that's feasible, it would be very helpful. Regardless, I'll continue attempting to use their API and see if I can get better results.
Hello, I've managed to figure out the way to evaluate the Gemini models. Please review the results provided below. I will soon add them on our leaderboard.
gemini-1.5-flash: 'Total': 0.3143, 'Distraction': 0.3, 'Redefination': 0.45, 'Shortcut': 0.3, 'Commonsense': 0.4, 'Cornercase': 0.2, 'Complex': 0.15, 'Codesense': 0.4
gemini-1.5-pro: 'Total': 0.4286, 'Distraction': 0.25, 'Redefination': 0.45, 'Shortcut': 0.4, 'Commonsense': 0.6, 'Cornercase': 0.45, 'Complex': 0.25, 'Codesense': 0.6
thank you for give it tries. I am so sorry for late response, sadly i don't think i am know better ways to evaluate it, lucky you to figure it out.
the results is make sense given other relevant benchmarks about this tasks. i request if because it have long context window and context caching.
Model introduction
The models created by Google and excellent at range of task, natively multimodal and longest context window available in market (up to two millions token window). the coding capabilities seems not top notch compare to leading model in many benchmark but still curious about diverse test.
Model URL (Optional)
https://deepmind.google/technologies/gemini/pro/
Additional information (Optional)
No response
Decontamination
from the technical report paper: """ HumanEval leakage HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conservative filtering heuristics. An analysis of the test data leakage of Gemini 1.0 Ultra showed that continued pre-training on a dataset containing even a single epoch of the test split for HumanEval boosted scores from 74.4% to 89.0%, highlighting the danger of data contamination. We found that this sharp increase persisted even when examples were embedded in extraneous formats (e.g. JSON, HTML). We invite researchers assessing coding abilities of these models head-to-head to always maintain a small set of truly held-out test functions that are written in-house, thereby minimizing the risk of leakage. The Natural2Code benchmark, which we announced and used in the evaluation of Gemini 1.0 series of models, was created to fill this gap. It follows the exact same format of HumanEval but with a different set of prompts and tests. """
Author
No
Data
No
Security
Integrity