GoogleCloudPlatform / generative-ai

Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI
https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview
Apache License 2.0
7.16k stars 1.93k forks source link

[Bug]: evaluation can't handle MAX_TOKENS finish resion #580

Closed hsuyuming closed 5 months ago

hsuyuming commented 6 months ago

File Name

score_and_select_models_rapid_evaluation_sdk.ipynb

What happened?

When I execute "Running evaluation", I get a RuntimeError error. After further investigation, this error occurs because when we set max_output_tokens to 128 for Gemini, it returns MAX_TOKENS as the finish reason based on reaching the output token limit. Based on line [1], if the finish reason is not STOP and FINISH_REASON_UNSPECIFIED, then the SDK will throw an exception.

[1] https://github.com/googleapis/python-aiplatform/blob/main/vertexai/preview/evaluation/_evaluation.py#L220

Relevant log output

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/vertexai/preview/evaluation/_evaluation.py:221, in _generate_response_from_gemini(model, prompt)
    220 if candidate.finish_reason not in _SUCCESSFUL_FINISH_REASONS:
--> 221     raise RuntimeError(
    222         "The model response did not completed successfully.\n"
    223         f"Finish reason: {candidate.finish_reason}.\n"
    224         f"Finish message: {candidate.finish_message}.\n"
    225         f"Safety ratings: {candidate.safety_ratings}.\n"
    226         "Please adjsut the model safety_settings, or try a different prompt."
    227     )
    228 return response.candidates[0].content.parts[0].text

RuntimeError: The model response did not completed successfully.
Finish reason: 2.
Finish message: .
Safety ratings: [category: HARM_CATEGORY_HATE_SPEECH
probability: NEGLIGIBLE
probability_score: 0.121790156
severity: HARM_SEVERITY_NEGLIGIBLE
severity_score: 0.101589449
, category: HARM_CATEGORY_DANGEROUS_CONTENT
probability: NEGLIGIBLE
probability_score: 0.143427476
severity: HARM_SEVERITY_NEGLIGIBLE
severity_score: 0.124000691
, category: HARM_CATEGORY_HARASSMENT
probability: NEGLIGIBLE
probability_score: 0.175820917
severity: HARM_SEVERITY_NEGLIGIBLE
severity_score: 0.0951811224
, category: HARM_CATEGORY_SEXUALLY_EXPLICIT
probability: NEGLIGIBLE
probability_score: 0.171615809
severity: HARM_SEVERITY_NEGLIGIBLE
severity_score: 0.0794957
].
Please adjsut the model safety_settings, or try a different prompt.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[23], line 9
      4 for _, (model_name, model) in tqdm(
      5     enumerate(zip(models.keys(), models.values())), total=len(models.keys())
      6 ):
      7     experiment_run_name = f"eval-{model_name}-{run_id}"
----> 9     eval_result = summarization_eval_task.evaluate(
     10         model=model,
     11         prompt_template=str(prompt_template),
     12         experiment_run_name=experiment_run_name,
     13     )
     15     eval_results.append(
     16         (f"Model {model_name}", eval_result.summary_metrics, eval_result.metrics_table)
     17     )

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/vertexai/preview/evaluation/_eval_tasks.py:329, in EvalTask.evaluate(self, model, prompt_template, experiment_run_name, response_column_name)
    325 if self.experiment and global_experiment_name:
    326     metadata._experiment_tracker.set_experiment(
    327         experiment=self.experiment, backing_tensorboard=False
    328     )
--> 329     eval_result = self._evaluate_with_experiment(
    330         model, prompt_template, experiment_run_name, response_column_name
    331     )
    332     metadata._experiment_tracker.set_experiment(
    333         experiment=global_experiment_name, backing_tensorboard=False
    334     )
    335 elif self.experiment and not global_experiment_name:

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/vertexai/preview/evaluation/_eval_tasks.py:273, in EvalTask._evaluate_with_experiment(self, model, prompt_template, experiment_run_name, response_column_name)
    271 with vertexai.preview.start_run(experiment_run_name):
    272     self._log_eval_experiment_param(model, prompt_template)
--> 273     eval_result = _evaluation.evaluate(
    274         dataset=self.dataset,
    275         metrics=self.metrics,
    276         model=model,
    277         prompt_template=prompt_template,
    278         content_column_name=self.content_column_name,
    279         reference_column_name=self.reference_column_name,
    280         response_column_name=response_column_name or self.response_column_name,
    281     )
    282     try:
    283         vertexai.preview.log_metrics(eval_result.summary_metrics)

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/vertexai/preview/evaluation/_evaluation.py:543, in evaluate(dataset, metrics, model, prompt_template, content_column_name, reference_column_name, response_column_name, context_column_name, instruction_column_name)
    538     evaluation_run_config.validate_dataset_column(
    539         constants.Dataset.CONTENT_COLUMN
    540     )
    542 if isinstance(model, generative_models.GenerativeModel):
--> 543     _generate_response_from_gemini_model(model, evaluation_run_config)
    544 elif callable(model):
    545     _generate_response_from_custom_model_fn(model, evaluation_run_config)

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/vertexai/preview/evaluation/_evaluation.py:255, in _generate_response_from_gemini_model(model, evaluation_run_config)
    241 """Generates responses from Gemini model.
    242 
    243 Args:
    244     model: The Gemini model instance.
    245     evaluation_run_config: Evaluation Run Configurations.
    246 """
    247 if (
    248     constants.Dataset.COMPLETED_PROMPT_COLUMN
    249     in evaluation_run_config.dataset.columns
    250 ):
    251     evaluation_run_config.dataset[
    252         constants.Dataset.MODEL_RESPONSE_COLUMN
    253     ] = evaluation_run_config.dataset[
    254         constants.Dataset.COMPLETED_PROMPT_COLUMN
--> 255     ].apply(
    256         lambda x: _generate_response_from_gemini(model, x)
    257     )
    258 else:
    259     evaluation_run_config.dataset[
    260         constants.Dataset.MODEL_RESPONSE_COLUMN
    261     ] = evaluation_run_config.dataset[
   (...)
    264         lambda x: _generate_response_from_gemini(model, x)
    265     )

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/pandas/core/series.py:4764, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
   4629 def apply(
   4630     self,
   4631     func: AggFuncType,
   (...)
   4636     **kwargs,
   4637 ) -> DataFrame | Series:
   4638     """
   4639     Invoke function on values of Series.
   4640 
   (...)
   4755     dtype: float64
   4756     """
   4757     return SeriesApply(
   4758         self,
   4759         func,
   4760         convert_dtype=convert_dtype,
   4761         by_row=by_row,
   4762         args=args,
   4763         kwargs=kwargs,
-> 4764     ).apply()

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/pandas/core/apply.py:1209, in SeriesApply.apply(self)
   1206     return self.apply_compat()
   1208 # self.func is Callable
-> 1209 return self.apply_standard()

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/pandas/core/apply.py:1289, in SeriesApply.apply_standard(self)
   1283 # row-wise access
   1284 # apply doesn't have a `na_action` keyword and for backward compat reasons
   1285 # we need to give `na_action="ignore"` for categorical data.
   1286 # TODO: remove the `na_action="ignore"` when that default has been changed in
   1287 #  Categorical (GH51645).
   1288 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1289 mapped = obj._map_values(
   1290     mapper=curried, na_action=action, convert=self.convert_dtype
   1291 )
   1293 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1294     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1295     #  See also GH#25959 regarding EA support
   1296     return obj._constructor_expanddim(list(mapped), index=obj.index)

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/pandas/core/base.py:921, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
    918 if isinstance(arr, ExtensionArray):
    919     return arr.map(mapper, na_action=na_action)
--> 921 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/pandas/core/algorithms.py:1814, in map_array(arr, mapper, na_action, convert)
   1812 values = arr.astype(object, copy=False)
   1813 if na_action is None:
-> 1814     return lib.map_infer(values, mapper, convert=convert)
   1815 else:
   1816     return lib.map_infer_mask(
   1817         values, mapper, mask=isna(values).view(np.uint8), convert=convert
   1818     )

File lib.pyx:2926, in pandas._libs.lib.map_infer()

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/vertexai/preview/evaluation/_evaluation.py:256, in _generate_response_from_gemini_model.<locals>.<lambda>(x)
    241 """Generates responses from Gemini model.
    242 
    243 Args:
    244     model: The Gemini model instance.
    245     evaluation_run_config: Evaluation Run Configurations.
    246 """
    247 if (
    248     constants.Dataset.COMPLETED_PROMPT_COLUMN
    249     in evaluation_run_config.dataset.columns
    250 ):
    251     evaluation_run_config.dataset[
    252         constants.Dataset.MODEL_RESPONSE_COLUMN
    253     ] = evaluation_run_config.dataset[
    254         constants.Dataset.COMPLETED_PROMPT_COLUMN
    255     ].apply(
--> 256         lambda x: _generate_response_from_gemini(model, x)
    257     )
    258 else:
    259     evaluation_run_config.dataset[
    260         constants.Dataset.MODEL_RESPONSE_COLUMN
    261     ] = evaluation_run_config.dataset[
   (...)
    264         lambda x: _generate_response_from_gemini(model, x)
    265     )

File /opt/conda/envs/gemini_evaluation/lib/python3.10/site-packages/vertexai/preview/evaluation/_evaluation.py:230, in _generate_response_from_gemini(model, prompt)
    228         return response.candidates[0].content.parts[0].text
    229 except Exception:
--> 230     raise RuntimeError(
    231         "Failed to generate response candidates from Gemini model.\n"
    232         f"Response: {response}.\n"
    233         f"Prompt: {prompt}."
    234     )

RuntimeError: Failed to generate response candidates from Gemini model.
Response: candidates {
  content {
    role: "model"
    parts {
      text: "## Spaghetti Carbonara Summary\n\nThis recipe outlines the steps to make a classic spaghetti carbonara. Here\'s a summary:\n\n**Ingredients:**\n\n* Spaghetti\n* Pancetta or guanciale\n* Olive oil\n* Eggs\n* Grated Parmesan cheese\n* Black pepper\n* Salt\n\n**Instructions:**\n\n1. Boil a large pot of salted water.\n2. While the water heats, cook pancetta/guanciale in olive oil until crispy.\n3. Remove pancetta and set aside.\n4. Whisk eggs, Parmesan cheese, and black pepper in the same skillet.\n5. Cook pasta"
    }
  }
  finish_reason: MAX_TOKENS
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
    probability_score: 0.121790156
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.101589449
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
    probability_score: 0.143427476
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.124000691
  }
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
    probability_score: 0.175820917
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.0951811224
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
    probability_score: 0.171615809
    severity: HARM_SEVERITY_NEGLIGIBLE
    severity_score: 0.0794957
  }
}
usage_metadata {
  prompt_token_count: 137
  candidates_token_count: 128
  total_token_count: 265
}
.
Prompt: Summarize the following article. Article: To make a classic spaghetti carbonara, start by bringing a large pot of salted water to a boil. While the water is heating up, cook pancetta or guanciale in a skillet with olive oil over medium heat until it's crispy and golden brown. Once the pancetta is done, remove it from the skillet and set it aside. In the same skillet, whisk together eggs, grated Parmesan cheese, and black pepper to make the sauce. When the pasta is cooked al dente, drain it and immediately toss it in the skillet with the egg mixture, adding a splash of the pasta cooking water to create a creamy sauce.. Summary:.

Code of Conduct

hsuyuming commented 5 months ago

@jsondai Do you have any idea regarding for this one?

hsuyuming commented 5 months ago

I can increate output token, but i don't think that is a long term solution for user

jsondai commented 5 months ago

Hi, this is a bug that we plan to fix soon. The Rapid Eval SDK can continue the evaluation process with partially generated response with a warning message, instead of throwing an error.