This PR primarily adjusts the system prompt in the job_chat service to be less strict about external information. It also adds a notebook to evaluate the online and the new prompts.
Fixes #97 #108 partially.
Implementation Details
To address issue #97 , the prompt was edited to allow the assistant to provide information on external platforms and services.
To address issue #108 , this PR also adds a notebook to generate a prompt test dataset targeting the issue in question. The notebook provides an initial case study of how we can track and evaluate the effects of changes to the LLM pipeline more thoroughly without relying only on qualitative evaluation and spot checking.
The small fully generated evaluation dataset is also added in this PR, and it can be used as part of a routine test and expanded on as we target different issue areas. The dataset shows the LLM outputs for the same set of questions using the online v1 and the candidate v2 prompts. The generated result column indicates whether the answer successfully answered the question, and can be used to calculate a success score. The dataset shows that our new prompt improves the success rate on the external information issue from 20% to 60%.
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
[x] Code generation (copilot but not intellisense)
Short Description
This PR primarily adjusts the system prompt in the job_chat service to be less strict about external information. It also adds a notebook to evaluate the online and the new prompts.
Fixes #97 #108 partially.
Implementation Details
To address issue #97 , the prompt was edited to allow the assistant to provide information on external platforms and services.
To address issue #108 , this PR also adds a notebook to generate a prompt test dataset targeting the issue in question. The notebook provides an initial case study of how we can track and evaluate the effects of changes to the LLM pipeline more thoroughly without relying only on qualitative evaluation and spot checking.
The small fully generated evaluation dataset is also added in this PR, and it can be used as part of a routine test and expanded on as we target different issue areas. The dataset shows the LLM outputs for the same set of questions using the online v1 and the candidate v2 prompts. The generated result column indicates whether the answer successfully answered the question, and can be used to calculate a success score. The dataset shows that our new prompt improves the success rate on the external information issue from 20% to 60%.
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
You can read more details in our Responsible AI Policy