GoogleCloudPlatform / generative-ai

Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI
https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview
Apache License 2.0
6.68k stars 1.79k forks source link

Model tuning does not work #254

Open sidoncloud opened 10 months ago

sidoncloud commented 10 months ago

Like most of the code uploaded by Google developers , your model tuning code that uses the stackoverflow data fails miserably giving the below errors.

{
  "summary": "Found 7 errors in your file. See 'errors' field for specific details.\nValidated 4000 examples for tokenization. Found 7 examples where either 'input_text' or 'output_text' exceeds the model token limits. See 'tokenization_issues' field for some specific examples.\nValidated 1000 examples for RAI. Found 43 examples that has RAI issues. See 'rai_issues' field for some specific examples.\n",
  "max_user_input_token_length": 8177,
  "tokenization_issues": [
    "Row: 122. Token limit exceeded for 'input_text' [tokens: 15851|limit: 8192] or 'output_text' [tokens: 24|limit: 1024]",
    "Row: 362. Token limit exceeded for 'input_text' [tokens: 13474|limit: 8192] or 'output_text' [tokens: 19|limit: 1024]",
    "Row: 391. Token limit exceeded for 'input_text' [tokens: 10643|limit: 8192] or 'output_text' [tokens: 34|limit: 1024]",
    "Row: 528. Token limit exceeded for 'input_text' [tokens: 9351|limit: 8192] or 'output_text' [tokens: 17|limit: 1024]",
    "Row: 840. Token limit exceeded for 'input_text' [tokens: 16309|limit: 8192] or 'output_text' [tokens: 33|limit: 1024]",
    "Row: 868. Token limit exceeded for 'input_text' [tokens: 20337|limit: 8192] or 'output_text' [tokens: 51|limit: 1024]",
    "Row: 1535. Token limit exceeded for 'input_text' [tokens: 8969|limit: 8192] or 'output_text' [tokens: 26|limit: 1024]"
  ],
  "rai_issues": [
    "Row: 15. RAI violation. High scores for categories Finance",
    "Row: 46. RAI violation. High scores for categories Finance",
    "Row: 275. RAI violation. High scores for categories Finance",
    "Row: 401. RAI violation. High scores for categories Finance",
    "Row: 444. RAI violation. High scores for categories Health",
    "Row: 503. RAI violation. High scores for categories Finance",
    "Row: 558. RAI violation. High scores for categories Finance",
    "Row: 571. RAI violation. High scores for categories Health",
    "Row: 848. RAI violation. High scores for categories Finance",
    "Row: 934. RAI violation. High scores for categories Finance",
    "... there are more cases ..."
  ],
  "errors": [
    "Row: 122. exceeds token limit",
    "Row: 362. exceeds token limit",
    "Row: 391. exceeds token limit",
    "Row: 528. exceeds token limit",
    "Row: 840. exceeds token limit",
    "Row: 868. exceeds token limit",
    "Row: 1535. exceeds token limit"
  ],
  "max_user_output_token_length": 79
}
fmichaelobrien commented 10 months ago

Understood, I am new to this repo but an LLM enthusiast. I can try some reproduction and triage based on a specific use case and code specific run you encountered. Here to help.

paulav6 commented 9 months ago

I faced similar rai_issues even with private data. It marked when I had a person's name or asked about going to a specific bank website. It went away once I removed those samples from my jsonl file. So, unless these examples were crucial, you could try removing them.