Generation with custom data & evaluator

amitshermann commented 2 months ago

Hi,

We'd like to use AutoPrompt for a generation task where both the input and output are text. We've also developed an evaluator that scores the input-output pairs (e.g., a float between 0 and 1).

Our goal is to optimize the output using our dataset and evaluator, but we're unsure how to set this up with AutoPrompt. Could you provide guidance on how to achieve this?

Thanks in advance,

Eladlev commented 2 months ago

Hi, Yes, it is relatively simple to tweak the system for this use case. The steps that you should follow:

Remove the first step in the optimization (the ranker optimization). Lines 40-54 in the run_generation_pipeline.py
You should prepare a csv with your dataset inputs, following this comment instructions, the only difference is that in your case you can also leave the annotation field empty
Put the csv from step 2 in a folder <base_folder>/generator/dataset.csv, and add the flag --load_dump <base_folder>
In this if you should add the option custom and set it to: return utils.set_function_from_iterrow(lambda record: custom_score( '###User input:\n' + generation_dataset['text'] + '\n####model prediction:\n' + generation_dataset['prediction'])) where custom_score is your score function (adapt the format according to the function input)
In the config file change this value to custom, and the error_threshold to 0.5
Here change the scale to be 0-1

That's all! It should work with all these changes. If there are any issues I can help with the integration also on the discord server.

amitshermann commented 2 months ago

Thank you, What does the error_threshold mean? Will it make the score Boolean? Because then my custom_eval function kinda looses it's meaning. For example, I want the model to understand the different between a score of 0.8 and a score of 0.6.

Thanks in advance,

Eladlev commented 2 months ago

The error_threshold determines what is the list of examples that is provided to the analyzer (we get the worst from this list). These samples are considered as samples that potentially could be improved. You can put here very high threshold (for example 0.9), if there are too many samples it simply takes the worst from this list

Eladlev / AutoPrompt

Generation with custom data & evaluator #98