Eladlev / AutoPrompt

A framework for prompt tuning using Intent-based Prompt Calibration
Apache License 2.0
2.02k stars 165 forks source link

how to set the eval part in generation tasks #72

Open tfk12 opened 1 month ago

tfk12 commented 1 month ago

In face of generation situations, I guess we should set function name to ranking. And as so we should set function_params. That's not so obvious.

the following not works for me.

Can we have a concrete demo for generation tasks? Thanks in advance

eval: function_name: 'ranking' function_params: prompt: 'prompts/predictor_completion/prediction_generation.prompt' mini_batch_size: 1 llm: type: '...' num_workers: 10 mode: 'verbose' label_schema: ["1","2","3","4","5"]

Eladlev commented 1 month ago

Hi, we have a generation example with all the details: https://github.com/Eladlev/AutoPrompt/blob/main/docs/examples.md#generating-movie-reviews-generation-task

Let me know if something is not clear

tfk12 commented 1 month ago

Hi, we have a generation example with all the details: https://github.com/Eladlev/AutoPrompt/blob/main/docs/examples.md#generating-movie-reviews-generation-task

Let me know if something is not clear

Thank you for your instant response. I have already referred to this part of the content before. However, my main point of confusion is whether I should set the eval function name as “ranking” in this generation scenario, as well as any subsequent settings that may be required.

Or should I still leave the eval function name to 'accuracy'?

Eladlev commented 1 month ago

The short answer is that you should not modify the eval function, because this is done in the run_generation_pipeline.py code itself.

More details on the process: The run_generation_pipeline code consists of two parts:

  1. Optimizing the ranker prompt, where we treat the task as a classification task, so the metric here is still the accuracy
  2. Optimizing the generation prompt. In this part, we are taking the ranking prompt from the first step and define the score function according to this ranking prompt

generation_config_params.eval.function_params = ranker_config_params.predictor.config
generation_config_params.eval.function_params.instruction = best_prompt['prompt']
generation_config_params.eval.function_params.label_schema = ranker_config_params.dataset.label_schema
tfk12 commented 1 month ago

and I still got an error like others 'Keyerror' related to samples

### Here is my default_config:

use_wandb: False dataset: name: 'dataset' records_path: null initial_dataset: '' label_schema: ["Yes", "No"] max_samples: 50 semantic_sampling: False # Change to True in case you don't have M1. Currently there is an issue with faiss and M1

annotator: method : ''

predictor: method : 'llm' config: llm: type: 'azure' name: '...' num_workers: 5 prompt: 'prompts/predictor_completion/prediction_generation.prompt' mini_batch_size: 1 #change to >1 if you want to include multiple samples in the one prompt mode: 'prediction'

meta_prompts: folder: 'prompts/meta_prompts_generation' num_err_prompt: 1 # Number of error examples per sample in the prompt generation num_err_samples: 2 # Number of error examples per sample in the sample generation history_length: 4 # Number of sample in the meta-prompt history num_generated_samples: 5 # Number of generated samples at each iteration num_initialize_samples: 5 # Number of generated samples at iteration 0, in zero-shot case samples_generation_batch: 5 # Number of samples generated in one call to the LLM num_workers: 5 #Number of parallel workers warmup: 4 # Number of warmup steps

eval: function_name: 'accuracy' num_large_errors: 4 num_boundary_predictions : 0 error_threshold: 0.5

llm: type: 'azure' name: '...' temperature: 0.8

stop_criteria: max_usage: 2 #In $ in case of OpenAI models, otherwise number of tokens patience: 10 # Number of patience steps min_delta: 0.01 # Delta for the improvement definition

### Following is the logging:

Starting step 0 Dataset is empty generating initial samples Processing samples: 0%| | 0/1 [00:00<?, ?it/s]LLM_CHAIN {'num_samples': 5, 'task_description': 'Evaluate how well Assistant, a large language model, performs in generating movie reviews in accordance with its designated task.', 'instruction': 'Rate the quality of the generated movie review on a scale of 1 to 5, with 1 being poor and 5 being excellent. Please ensure that the generated review provides a comprehensive and detailed analysis of the specific movie as requested in the user prompt.'} Processing samples: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:12<00:00, 12.28s/it] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮

...... /optimization_pipeline.py:192 in
self.config.meta_prompts.samples_gene │ │ 190 │ │ │ │ 191 │ │ samples_batches = self.meta_chain.initial_chain.batch_invoke(batch_inputs, self. │ │ ❱ 192 │ │ samples_list = [element for sublist in samples_batches for element in sublist['s │ │ 193 │ │ samples_list = self.dataset.remove_duplicates(samples_list) │ │ 194 │ │ self.dataset.add(samples_list, 0) │ │ 195 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ KeyError: 'samples'

tfk12 commented 1 month ago

and I still got an error like others 'Keyerror' related to samples

Here is my default_config:

use_wandb: False dataset: name: 'dataset' records_path: null initial_dataset: '' label_schema: ["Yes", "No"] max_samples: 50 semantic_sampling: False # Change to True in case you don't have M1. Currently there is an issue with faiss and M1

annotator: method : ''

predictor: method : 'llm' config: llm: type: 'azure' name: '...' num_workers: 5 prompt: 'prompts/predictor_completion/prediction_generation.prompt' mini_batch_size: 1 #change to >1 if you want to include multiple samples in the one prompt mode: 'prediction'

meta_prompts: folder: 'prompts/meta_prompts_generation' num_err_prompt: 1 # Number of error examples per sample in the prompt generation num_err_samples: 2 # Number of error examples per sample in the sample generation history_length: 4 # Number of sample in the meta-prompt history num_generated_samples: 5 # Number of generated samples at each iteration num_initialize_samples: 5 # Number of generated samples at iteration 0, in zero-shot case samples_generation_batch: 5 # Number of samples generated in one call to the LLM num_workers: 5 #Number of parallel workers warmup: 4 # Number of warmup steps

eval: function_name: 'accuracy' num_large_errors: 4 num_boundary_predictions : 0 error_threshold: 0.5

llm: type: 'azure' name: '...' temperature: 0.8

stop_criteria: max_usage: 2 #In $ in case of OpenAI models, otherwise number of tokens patience: 10 # Number of patience steps min_delta: 0.01 # Delta for the improvement definition

Following is the logging:

Starting step 0 Dataset is empty generating initial samples Processing samples: 0%| | 0/1 [00:00<?, ?it/s]LLM_CHAIN {'num_samples': 5, 'task_description': 'Evaluate how well Assistant, a large language model, performs in generating movie reviews in accordance with its designated task.', 'instruction': 'Rate the quality of the generated movie review on a scale of 1 to 5, with 1 being poor and 5 being excellent. Please ensure that the generated review provides a comprehensive and detailed analysis of the specific movie as requested in the user prompt.'} Processing samples: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:12<00:00, 12.28s/it] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮

...... /optimization_pipeline.py:192 in self.config.meta_prompts.samples_gene │ │ 190 │ │ │ │ 191 │ │ samples_batches = self.meta_chain.initial_chain.batch_invoke(batch_inputs, self. │ │ ❱ 192 │ │ samples_list = [element for sublist in samples_batches for element in sublist['s │ │ 193 │ │ samples_list = self.dataset.remove_duplicates(samples_list) │ │ 194 │ │ self.dataset.add(samples_list, 0) │ │ 195 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ KeyError: 'samples'

the command is:

run_generation_pipeline.py \ --prompt "Write a good and comprehensive movie review about a specific movie." \ --task_description "Assistant is a large language model that is tasked with writing movie reviews."

Eladlev commented 1 month ago

It seems like the model was not able to generate samples. I'm guessing that it's either issues with the connection to your Azure account, or you are using GPT-3.5 as the base LLM (he is too weak for handling the meta-prompts)

There are log files that are generated in the dump folder that might help to understand the root cause, feel free to reach out in the discord channel if you need help with the debugging process.

tfk12 commented 1 month ago

It seems like the model was not able to generate samples. I'm guessing that it's either issues with the connection to your Azure account, or you are using GPT-3.5 as the base LLM (he is too weak for handling the meta-prompts)

There are log files that are generated in the dump folder that might help to understand the root cause, feel free to reach out in the discord channel if you need help with the debugging process.

Hi, thank you for your help. I've change the model to GPT-4, and the error remains the same. Am I using the correct prompt files?

attachment is the log file in the dump folder. Something is weird as I do not have gpt-3.5-turbo model deployment within this endpoint. But the log file contains gpt3.5. Does that come from the code scripts?

捕获
tfk12 commented 1 month ago

log file content:

2024-07-17 16:22:22,175 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,176 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,187 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,188 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,201 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,202 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,213 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,213 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,225 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,226 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,236 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,237 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,249 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,249 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,260 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,260 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,271 - INFO - Initialize dataset 2024-07-17 16:22:22,273 - INFO - Load initial dataset from 2024-07-17 16:22:22,274 - WARNING - Dataset dump not found, initializing from zero 2024-07-17 16:22:22,275 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,276 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,286 - DEBUG - load_ssl_context verify=True cert=None trust_env=True http2=False 2024-07-17 16:22:22,287 - DEBUG - load_verify_locations cafile='/home/tfk/miniconda3/envs/py311/lib/python3.11/site-packages/certifi/cacert.pem' 2024-07-17 16:22:22,311 - DEBUG - Request options: {'method': 'post', 'url': '/chat/completions', 'headers': {'api-key': '*****'}, 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': 'Assistant is a large language model designed to generate a task description.\nYou are given a task description phrased as text generation task given some user input. Your task is to rephrase it as a task that suppose to evaluate the quality of the given generative task and how well it adhere to the user input.\n#####\nInput task description: Assistant is a large language model that is tasked with writing movie reviews.\n#####\nRephrased task description:'}], 'model': 'gpt-3.5-turbo', 'n': 1, 'stream': False, 'temperature': 0.8}} 2024-07-17 16:22:22,316 - DEBUG - connect_tcp.started host='tfkgpt4.openai.azure.com' port=443 local_address=None timeout=None socket_options=None 2024-07-17 16:22:22,798 - DEBUG - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fce8edd90d0> 2024-07-17 16:22:22,798 - DEBUG - start_tls.started ssl_context=<ssl.SSLContext object at 0x7fce8ee68e60> server_hostname='tfkgpt4.openai.azure.com' timeout=None 2024-07-17 16:22:23,329 - DEBUG - start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fce8edd9d90> 2024-07-17 16:22:23,330 - DEBUG - send_request_headers.started request=<Request [b'POST']> 2024-07-17 16:22:23,330 - DEBUG - send_request_headers.complete 2024-07-17 16:22:23,330 - DEBUG - send_request_body.started request=<Request [b'POST']> 2024-07-17 16:22:23,330 - DEBUG - send_request_body.complete 2024-07-17 16:22:23,330 - DEBUG - receive_response_headers.started request=<Request [b'POST']> 2024-07-17 16:22:27,886 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Cache-Control', b'no-cache, must-revalidate'), (b'Content-Length', b'958'), (b'Content-Type', b'application/json'), (b'access-control-allow-origin', b'*'), (b'apim-request-id', b'7237536d-cf73-4912-8ad8-a1023f18ebd2'), (b'Strict-Transport-Security', b'max-age=31536000; includeSubDomains; preload'), (b'x-content-type-options', b'nosniff'), (b'x-ms-region', b'East US 2'), (b'x-ratelimit-remaining-requests', b'9'), (b'x-ratelimit-remaining-tokens', b'9984'), (b'x-accel-buffering', b'no'), (b'x-ms-rai-invoked', b'true'), (b'x-request-id', b'cdd4ceeb-0a5c-4c1c-8ad5-02c0abc3168f'), (b'x-ms-client-request-id', b'7237536d-cf73-4912-8ad8-a1023f18ebd2'), (b'azureml-model-session', b'd031-20240531174243'), (b'Date', b'Wed, 17 Jul 2024 08:22:27 GMT')]) 2024-07-17 16:22:27,887 - INFO - HTTP Request: POST https://tfkgpt4.openai.azure.com//openai/deployments/GPT4/chat/completions?api-version=2024-02-01 "HTTP/1.1 200 OK" 2024-07-17 16:22:27,887 - DEBUG - receive_response_body.started request=<Request [b'POST']> 2024-07-17 16:22:27,887 - DEBUG - receive_response_body.complete 2024-07-17 16:22:27,887 - DEBUG - response_closed.started 2024-07-17 16:22:27,887 - DEBUG - response_closed.complete 2024-07-17 16:22:27,887 - DEBUG - HTTP Request: POST https://tfkgpt4.openai.azure.com//openai/deployments/GPT4/chat/completions?api-version=2024-02-01 "200 OK" 2024-07-17 16:22:27,890 - INFO - Task description modified for ranking to: Evaluate the performance of the Assistant, a large language model, in terms of its ability to generate high-quality movie reviews that accurately reflect and adhere to the user's input. 2024-07-17 16:22:27,898 - DEBUG - Request options: {'method': 'post', 'url': '/chat/completions', 'headers': {'api-key': '*****'}, 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': "Assistant is a large language model designed to generate instructions for every task.\nYou are given a instructions phrased as text generation task.\nYour task is to write an instruction for a classification ranking task that suppose to evaluate the quality of a generated sample given a user prompt for this generative instruction.\nGuidelines:\n1. The classifier labels are ['1', '2', '3', '4', '5']. The result instructions should indicate explicitly that the task is a classification class with the following labels ['1', '2', '3', '4', '5']!\n2. The generated instruction must also evaluate how well the generated sample adhere the user prompt\n#####\nInput generative instruction: Write a good and comprehensive movie review about a specific movie.\n#####\nRephrased classification quality evaluation instruction:"}], 'model': 'gpt-3.5-turbo', 'n': 1, 'stream': False, 'temperature': 0.8}} 2024-07-17 16:22:27,899 - DEBUG - send_request_headers.started request=<Request [b'POST']> 2024-07-17 16:22:27,899 - DEBUG - send_request_headers.complete 2024-07-17 16:22:27,899 - DEBUG - send_request_body.started request=<Request [b'POST']> 2024-07-17 16:22:27,900 - DEBUG - send_request_body.complete 2024-07-17 16:22:27,900 - DEBUG - receive_response_headers.started request=<Request [b'POST']> 2024-07-17 16:22:38,723 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Cache-Control', b'no-cache, must-revalidate'), (b'Content-Length', b'1702'), (b'Content-Type', b'application/json'), (b'access-control-allow-origin', b'*'), (b'apim-request-id', b'6d5661cf-572e-4f1d-9037-9eaf6d0616ec'), (b'Strict-Transport-Security', b'max-age=31536000; includeSubDomains; preload'), (b'x-content-type-options', b'nosniff'), (b'x-ms-region', b'East US 2'), (b'x-ratelimit-remaining-requests', b'8'), (b'x-ratelimit-remaining-tokens', b'9968'), (b'x-accel-buffering', b'no'), (b'x-ms-rai-invoked', b'true'), (b'x-request-id', b'cebf4ca3-af25-425d-a9a1-96dbfd612ddb'), (b'x-ms-client-request-id', b'6d5661cf-572e-4f1d-9037-9eaf6d0616ec'), (b'azureml-model-session', b'd031-20240531174243'), (b'Date', b'Wed, 17 Jul 2024 08:22:38 GMT')]) 2024-07-17 16:22:38,723 - INFO - HTTP Request: POST https://tfkgpt4.openai.azure.com//openai/deployments/GPT4/chat/completions?api-version=2024-02-01 "HTTP/1.1 200 OK" 2024-07-17 16:22:38,723 - DEBUG - receive_response_body.started request=<Request [b'POST']> 2024-07-17 16:22:38,723 - DEBUG - receive_response_body.complete 2024-07-17 16:22:38,723 - DEBUG - response_closed.started 2024-07-17 16:22:38,723 - DEBUG - response_closed.complete 2024-07-17 16:22:38,724 - DEBUG - HTTP Request: POST https://tfkgpt4.openai.azure.com//openai/deployments/GPT4/chat/completions?api-version=2024-02-01 "200 OK" 2024-07-17 16:22:38,725 - INFO - Initial prompt modified for ranking to: Rate the quality of the provided movie review text based on the following classification labels:

Please classify the text accordingly, ensuring that the quality of the generated sample is evaluated based on how well it adheres to the given user prompt. 2024-07-17 16:22:38,725 - INFO - Starting step 0 2024-07-17 16:22:38,726 - INFO - Dataset is empty generating initial samples 2024-07-17 16:22:38,735 - DEBUG - Request options: {'method': 'post', 'url': '/chat/completions', 'headers': {'api-key': '*****'}, 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': "Assistant is a large language model designed to generate challenging samples for every task.\nGenerate a list of 5 challenging samples for the following task.\n### Task description:\nEvaluate the performance of the Assistant, a large language model, in terms of its ability to generate high-quality movie reviews that accurately reflect and adhere to the user's input.\n### Task Instruction:\nRate the quality of the provided movie review text based on the following classification labels:\n\n- '1': The review text does not match the given prompt, and does not provide an appropriate or comprehensive review of the specified movie.\n- '2': The review text somewhat aligns with the prompt, but lacks in providing a comprehensive review of the specified movie.\n- '3': The review text aligns with the prompt to an extent and provides an average level of detail about the movie, but there's room for improvement.\n- '4': The review text mostly aligns with the prompt, giving a detailed and satisfactory review of the specified movie.\n- '5': The review text perfectly adheres to the prompt, providing an excellent and comprehensive review of the specified movie. \n\nPlease classify the text accordingly, ensuring that the quality of the generated sample is evaluated based on how well it adheres to the given user prompt.\n###\n### Requirements for Challenging Samples:\n1. The generated samples must be challenging and diverse such that using the task instruction as a prompt will result in the wrong result.\n2. The generated samples must be only from the top two scores! With equal distribution between the two.\n3. The generated samples should be distinct, realistic, and vary significantly to ensure diversity.\n\nIf the task depends both on a context, or a user input and a generated content then the sample content must include all the relevant parts.\n -In this case the sample content structure should be as follows:\n 1. First write the require context or user input.\n 2. Then write the generated content of the model on this context or user input.\n The style of the separation and the indication of the different parts, should be different in each sample."}], 'model': 'gpt-3.5-turbo', 'n': 1, 'stream': False, 'temperature': 0.8}} 2024-07-17 16:22:38,739 - DEBUG - connect_tcp.started host='tfkgpt4.openai.azure.com' port=443 local_address=None timeout=None socket_options=None 2024-07-17 16:22:39,016 - DEBUG - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fce8edff790> 2024-07-17 16:22:39,016 - DEBUG - start_tls.started ssl_context=<ssl.SSLContext object at 0x7fce8ee683b0> server_hostname='tfkgpt4.openai.azure.com' timeout=None 2024-07-17 16:22:39,507 - DEBUG - start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7fce8ee06950> 2024-07-17 16:22:39,507 - DEBUG - send_request_headers.started request=<Request [b'POST']> 2024-07-17 16:22:39,508 - DEBUG - send_request_headers.complete 2024-07-17 16:22:39,508 - DEBUG - send_request_body.started request=<Request [b'POST']> 2024-07-17 16:22:39,508 - DEBUG - send_request_body.complete 2024-07-17 16:22:39,508 - DEBUG - receive_response_headers.started request=<Request [b'POST']> 2024-07-17 16:23:24,000 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Cache-Control', b'no-cache, must-revalidate'), (b'Content-Length', b'4546'), (b'Content-Type', b'application/json'), (b'access-control-allow-origin', b'*'), (b'apim-request-id', b'41e55195-f4d5-4bf6-a69a-bc5a636e2473'), (b'Strict-Transport-Security', b'max-age=31536000; includeSubDomains; preload'), (b'x-content-type-options', b'nosniff'), (b'x-ms-region', b'East US 2'), (b'x-ratelimit-remaining-requests', b'9'), (b'x-ratelimit-remaining-tokens', b'9952'), (b'x-accel-buffering', b'no'), (b'x-ms-rai-invoked', b'true'), (b'x-request-id', b'a07ae1ce-fb97-442c-9706-ab8b2160f041'), (b'x-ms-client-request-id', b'41e55195-f4d5-4bf6-a69a-bc5a636e2473'), (b'azureml-model-session', b'd031-20240531174243'), (b'Date', b'Wed, 17 Jul 2024 08:23:23 GMT')]) 2024-07-17 16:23:24,001 - INFO - HTTP Request: POST https://tfkgpt4.openai.azure.com//openai/deployments/GPT4/chat/completions?api-version=2024-02-01 "HTTP/1.1 200 OK" 2024-07-17 16:23:24,001 - DEBUG - receive_response_body.started request=<Request [b'POST']> 2024-07-17 16:23:24,001 - DEBUG - receive_response_body.complete 2024-07-17 16:23:24,001 - DEBUG - response_closed.started 2024-07-17 16:23:24,001 - DEBUG - response_closed.complete 2024-07-17 16:23:24,001 - DEBUG - HTTP Request: POST https://tfkgpt4.openai.azure.com//openai/deployments/GPT4/chat/completions?api-version=2024-02-01 "200 OK"

Eladlev commented 1 month ago

It seems from the log that you are trying to use GPT-3.5. Can you try to use the default model?

llm:
    name: 'gpt-4-1106-preview'
tfk12 commented 1 month ago

I have changed to mdoel to gpt-4-1106-preview, and error remains the same. Can you please help me check with the prompt files I use?

predictor: method : 'llm' config: llm: type: 'azure' name: '...' num_workers: 5 prompt: 'prompts/predictor_completion/prediction_generation.prompt' mini_batch_size: 1 #change to >1 if you want to include multiple samples in the one prompt mode: 'prediction'

meta_prompts: folder: 'prompts/meta_prompts_generation' num_err_prompt: 1 # Number of error examples per sample in the prompt generation num_err_samples: 2 # Number of error examples per sample in the sample generation history_length: 4 # Number of sample in the meta-prompt history num_generated_samples: 5 # Number of generated samples at each iteration num_initialize_samples: 5 # Number of generated samples at iteration 0, in zero-shot case samples_generation_batch: 5 # Number of samples generated in one call to the LLM num_workers: 5 #Number of parallel workers warmup: 4 # Number of warmup steps

tfk12 commented 1 month ago

perhaps it's related with settings:

annotator: method : ‘’

begging for a generating example T.T

Eladlev commented 1 month ago

This is fine since there is no annotator in the generation optimization (only the evaluator uses the ranker). I'm suggesting to restore all the default parameters (in all the config files) and verify that there are no changes. Also, verify that you delete the dump folder, and then try to rerun the example.

tfk12 commented 1 month ago

well I restore all the default parameters, use GPT-4, and run classifying task. I still get 'Keyerror' of 'samples'. So I revised the 'optimization_pipeline.py' scripts and added some parse functions to extract 'Sample #' from the 'text'. And finally I successfully run the classifying sample.

However, generation sample remains unsolved. I get the error: ValueError: Classification metrics can't handle a mix of unknown and binary targets

Eladlev commented 1 month ago

It seems that the issue was sensitive to lower/upper cases (you used 'azure' and the system expected 'Azure') This PR should solve this issue: https://github.com/Eladlev/AutoPrompt/pull/76