Eladlev / AutoPrompt

A framework for prompt tuning using Intent-based Prompt Calibration
Apache License 2.0
1.86k stars 149 forks source link

At least one label specified must be in y_true #36

Closed alena-m closed 4 months ago

alena-m commented 4 months ago

Hi, with the latest changes I got a new error when run run_generation_pipeline.py

Traceback (most recent call last):
  File "path_to_repo/AutoPrompt/run_generation_pipeline_alena.py", line 64, in <module>
    best_prompt = ranker_pipeline.run_pipeline(opt.num_ranker_steps)
  File "path_to_repo/AutoPrompt/optimization_pipeline.py", line 272, in run_pipeline
    stop_criteria = self.step()
  File "path_to_repo/AutoPrompt/optimization_pipeline.py", line 252, in step
    self.eval.add_history(self.cur_prompt, self.task_description)
  File "path_to_repo/AutoPrompt/eval/evaluator.py", line 115, in add_history
    conf_matrix = confusion_matrix(self.dataset['annotation'],
  File "path_to_env/AutoPrompt/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "path_to_env/AutoPrompt/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 340, in confusion_matrix
    raise ValueError("At least one label specified must be in y_true")
ValueError: At least one label specified must be in y_true

config_ranking.yml and config_generation.yml are not modified. config_default.yml is

use_wandb: False
dataset:
    name: 'dataset'
    records_path: null
    initial_dataset: ''
    label_schema: ["Yes", "No"]
    max_samples: 5
    semantic_sampling: False # Change to True in case you don't have M1. Currently there is an issue with faiss and M1

# annotator:
#     method : 'argilla'
#     config:
#         api_url: ''
#         api_key: 'admin.apikey'
#         workspace: 'admin'
#         time_interval: 5

annotator:
    method: 'llm'
    config:
        llm:
            type: 'OpenAI'
            name: 'gpt-3.5-turbo-0613'
        instruction: 'Assess whether the text contains a harmful topic. 
            Answer Yes if it does and No otherwise.'
        num_workers: 2
        prompt: 'prompts/predictor_completion/prediction.prompt'
        mini_batch_size: 1
        mode: 'annotation'

predictor:
    method : 'llm'
    config:
        llm:
            type: 'OpenAI'
            name: 'gpt-3.5-turbo-0613'
#            async_params:
#                retry_interval: 10
#                max_retries: 2
            model_kwargs: {"seed": 220}
        num_workers: 2
        prompt: 'prompts/predictor_completion/prediction.prompt'
        mini_batch_size: 1  #change to >1 if you want to include multiple samples in the one prompt
        mode: 'prediction'

meta_prompts:
    folder: 'prompts/meta_prompts_classification'
    num_err_prompt: 1  # Number of error examples per sample in the prompt generation
    num_err_samples: 2 # Number of error examples per sample in the sample generation
    history_length: 4 # Number of sample in the meta-prompt history
    num_generated_samples: 10 # Number of generated samples at each iteration
    num_initialize_samples: 10 # Number of generated samples at iteration 0, in zero-shot case
    samples_generation_batch: 10 # Number of samples generated in one call to the LLM
    num_workers: 5 #Number of parallel workers
    warmup: 4 # Number of warmup steps

eval:
    function_name: 'accuracy'
    num_large_errors: 4
    num_boundary_predictions : 0
    error_threshold: 0.5

llm:
    type: 'OpenAI'
    name: 'gpt-3.5-turbo-0613'
    temperature: 0.8

stop_criteria:
    max_usage: 2 #In $ in case of OpenAI models, otherwise number of tokens
    patience: 3 # Number of patience steps
    min_delta: 0.05 # Delta for the improvement definition

I run command

python run_generation_pipeline.py \
    --prompt "Write a good and comprehensive movie review about a specific movie." \
    --task_description "Assistant is a large language model that is tasked with writing movie reviews."
Eladlev commented 4 months ago

Hi, Observe that you modify the annotator to be an LLM estimator. However, the prompt for the annotator asks the model to classify 'Yes' or 'No', where the ranker labels are '1','2',..,'5' (see the config_ranking label_schema). In this case, the annotator provides non-existing labels which results in this error.

An example of a valid instruction for your task: "Analyze the following movie review, and provide a score between 1 to 5"

One more thing, I see that you using gpt-3.5 for the meta-prompts (and the annotator). This should not work well, especially for the generation tasks, it's important to use GPT-4/4.5 to get optimal performances.

alena-m commented 4 months ago

Thanks! It works! This example is worth adding to the documentation.

danielliu99 commented 3 months ago

Hi @Eladlev , I was working on a generation task. I followed the instructions that in config_generation:

annotator:
    method : ''

and in config_default:

  method: 'llm'
  config:
      llm:
          type: 'OpenAI'
          name: 'gpt-4'
      instruction:
         "Assess this generated message,
         1. does it align with the intent of user input,
         2. does it rephrase user input,
         If all the answers are Yes, then response '1', otherwise response '0'"
      num_workers: 5
      prompt: 'prompts/predictor_completion/prediction.prompt'
      mini_batch_size: 1
      mode: 'annotation'

Is it expected that in the dump/generator/dataset.csv, the 'prediction' and 'score' are all blank? And can you suggest me the role of 'annotator' in generation tasks?

Thank you

Eladlev commented 3 months ago

Hi @danielliu99,

  1. At least the predictor should not be blank (after the iteration is completed)
  2. If you are using an LLM ranker, then you should skip the ranking training phase (since you already have an LLM ranker) and you should change this line: https://github.com/Eladlev/AutoPrompt/blob/7f373f219aa360cd2de38c6aa700c1dff282d7de/run_generation_pipeline.py#L53 To: generation_config_params.eval.function_params.instruction = ranker_config_params.annotator.config.instruction
  3. In the generation task, there are two phases in phase 1 we train a ranker prompt (this is the phase that should be skipped in your case), in this case, the role of the annotator is similar to the classification task. In the second part, we are not using an annotator (this is the method is left blank). Instead, we are modifying the score function to be the rank of the ranking model and we apply it directly on the model prediction (so there is a need for the annotator part since it's part of the score function).