Closed FraserLee closed 1 year ago
Just so that we don't duplicate, I'm nearly done with https://github.com/StampyAI/stampy-chat/issues/6. I should push something within a few days.
@cvarrichio This PR mostly sets up a means to easily benchmarked / test new prompts and generates the output to a CSV file. Fraser's looking for feedback on it. If you're able to play around with this new setup, it might speed up your own testing and refinement process as well.
I'm pretty sure this is exactly what I've been working on. Pausing for now.
@cvarrichio Shoot, you're right! I remember that you wanted to work on prompt engineering and now I remember mentioning that having the testing in place would make that work faster. I see that I even wrote it down on our planning docs, but Fraser wasn't as the meeting so probably didn't catch that. So sorry for the confusion.
FYI, I added some specific things to try for prompt engineering #3, which hopefully be much easier to try with these changes.
Ah sorry for stepping on your toes there, I thought you were more working on #2. If you have any improvements to my version feel free to open a second PR on top!
Prompt Engineering Workflow
As I had said previously, our prompt is a function - not a template. Certain parameters like what goes in a system vs a user prompt, how we delineate sources, where to break messages, how to handle history, etc - it's too much weight to be cleanly represented with anything but code. That said, this process should be fairly simple and straightforwards even to non-technical people:
construct_prompt
inapi/chat.py
cd api
pipenv run python3 prompteng/prompteng.py
file->import
. Clickupload
. Drag and drop the fileapi/prompteng/data/answers.csv
cmd + a
. Clickformat->wrapping->wrap
. Adjust column widths to preference.Possible future improvements
Copying over CSV manually is burdensome, but my efforts to try to get google sheets api didn't work out. Setting this up backed by our own DB could be nice.
We could throw GPT-4 in the loop to evaluate answers on some simple criteria (eg "Does the following answer contain any information not in the sources? Write about three sentences of explanation, then on a new line answer either YES or NO").
Although right now the prompt is loose enough that we need to keep it as a function, there might come a point where we nail down outer structure (eg 1 system message, 1 user message, system for sources in a certain way, final message of X, etc). We're not there yet, but when we reach this point we could have the instruction text parts of that stored in the spreadsheet and more easily modified and tracked.
I initially had the output format closer to what was described in #6 with the sources in separate columns to the right of the data, but found that to be a lot less readable than having them in middle lines between the answers. Still not sure if there might be a better format possible.
solving #46 would make the output contain less unused sources
Would love some feedback around this one before I merge!