RECITATION finishReason Causing Content Generation Stops in Google Models

gbaptista commented 1 week ago

Some Google models stop generating content due to finishReason = RECITATION.

According to the docs:

RECITATION: The token generation was stopped as the response was flagged for unauthorized citations.

gbaptista commented 1 week ago

How to easily simulate it:

Give the first page of the first chapter of Harry Potter.

{
  "candidates":[
    {
      "finishReason":"RECITATION",
      "safetyRatings":[
        {
          "category":"HARM_CATEGORY_HATE_SPEECH",
          "probability":"NEGLIGIBLE",
          "probabilityScore":0.31806138,
          "severity":"HARM_SEVERITY_NEGLIGIBLE",
          "severityScore":0.13039611
        },
        {
          "category":"HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability":"NEGLIGIBLE",
          "probabilityScore":0.13764834,
          "severity":"HARM_SEVERITY_NEGLIGIBLE",
          "severityScore":0.0248928
        },
        {
          "category":"HARM_CATEGORY_HARASSMENT",
          "probability":"NEGLIGIBLE",
          "probabilityScore":0.44049937,
          "severity":"HARM_SEVERITY_NEGLIGIBLE",
          "severityScore":0.17050801
        },
        {
          "category":"HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability":"NEGLIGIBLE",
          "probabilityScore":0.24653332,
          "severity":"HARM_SEVERITY_LOW",
          "severityScore":0.20914645
        }
      ],
      "citationMetadata":{
        "citations":[
          {
            "startIndex":268,
            "endIndex":417,
            "uri":"https://www.lisarivero.com/2011/06/24/plain-and-fancy-words/"
          },
          {
            "startIndex":302,
            "endIndex":581,
            "uri":"https://thefriendlyeditor.com/2012/03/09/rowling-hook-page-one/"
          }
        ]
      }
    }
  ],
  "usageMetadata":{
    "promptTokenCount":12,
    "candidatesTokenCount":97,
    "totalTokenCount":109
  }
}

Of course, these are probably expected results, with Google trying to avoid generating copyrighted content. The issue is that there are too many false positives, significantly halting generations for many prompts.

maayanorner commented 1 day ago

I have the same issue, I try to use Gemini for summarization. Naturally, summarization of copyrighted content would be flagged as "copyrighted content"; however, we have the explicit permission to use it.

gbaptista / gemini-ai

RECITATION finishReason Causing Content Generation Stops in Google Models #21