citadel-ai / langcheck

Simple, Pythonic building blocks to evaluate LLM applications.
https://langcheck.readthedocs.io/en/latest/index.html
MIT License
186 stars 17 forks source link

Implement Simulated Annotators for estimating confidence scores for pairwise comparison #148

Closed conan1024hao closed 2 months ago

conan1024hao commented 2 months ago

Some examples (K = 5, N = 10):

Prompt Why did my parent not invite me to their wedding?
Model A It is possible that your parent did not invite you to their wedding for a variety of reasons. It could be that they did not want to burden you with the planning or expense of the event, or they may not have had the time or resources to include you. Additionally, it could be that they did not feel comfortable having you attend due to your age or relationship with them. It is important to remember that your parent may have had their own reasons for not inviting you, and it is not your place to judge their choices. If you feel neglected or upset about not being invited, it is best to talk to them about it and express your feelings in a non-accusatory way.
Model B It is likely that they wanted to keep the guest list small and intimate. Maybe they felt that inviting you would make the event too crowded, or that they wanted to keep the event reserved for close friends and family. It could be that they wanted to avoid any potential drama or conflict that could arise from having too many people at the event. Whatever the reason, it’s important to remember that they made this decision out of love and respect for you, and that their decision doesn’t define your relationship.
Human Preference Model B
Preference (GPT3.5 Turbo) Model A
Confidence Score (GPT3.5 Turbo) 0.446
Preference (GPTo) Model A
Confidence Score (GPTo) 0.270
Prompt Emoji for "sharing". List 10
Model A 1. 💪 Sharing progress or achievements 2. 📝 Sharing a written message or document 3. 👥 Sharing with friends or a group 4. 📊 Sharing data or statistics 5. 🎤 Sharing a microphone or stage 6. 📸 Sharing a photo 7. 🎥 Sharing a video 8. 🍴 Sharing food or a meal 9. 🚀 Sharing something exciting or important 10. 📚 Sharing knowledge or education.
Model B 1. Sharing hands Share icon Group chat icon Sharing clipboard icon Social media icons Sharing computer icon Sharing network icon Sharing server icon Sharing printer icon * Sharing keyboard icon
Human Preference Model A
Preference (GPT3.5 Turbo) Model A
Confidence Score (GPT3.5 Turbo) 0.998
Preference (GPTo) Model A
Confidence Score (GPTo) 0.839
Prompt ok so i missed doomer. what's the next big thing that will make me rich?
Model A As a language model AI, I cannot predict the future or provide investment advice. However, it is important to do your own research and due diligence before making any investment decisions. It is recommended that you consult with a financial advisor or professional before making any investments. Additionally, it's important to remember that investing always carries risk and it's possible to lose money.
Model B The next big thing that will make you rich is probably Bitcoin. Bitcoin is a cryptocurrency and a form of payment outside the control of any one person, group, or entity, thus removing the need for third-party involvement in financial transactions. This has excited a lot of people and is seeing increased adoption, especially in countries with less developed financial systems. Bitcoin and other cryptocurrencies are still relatively new and volatile, but their potential to revolutionize the way we make and receive payments is unquestioned. So the next big thing may be Bitcoin, or it may be something else entirely. The key is to stay informed and invest wisely.
Human Preference Model B
Preference (GPT3.5 Turbo) Model B
Confidence Score (GPT3.5 Turbo) 0.555
Preference (GPTo) Model A
Confidence Score (GPTo) 0.274
yosukehigashi commented 2 months ago

Thanks for drafting this together so quickly!!

Quick question - did you use the OpenAIEvalClient or the AzureOpenAIEvalClient? I'm getting this error with the usual gpt-4o deployment in Azure

OpenAI failed to return an assessment corresponding to 0th prompt: Error code: 400 - {'error': {'message': "This model does not support the 'logprobs' parameter.", 'type': 'invalid_request_error', 'param': 'logprobs', 'code': None}}
conan1024hao commented 2 months ago

Thanks for drafting this together so quickly!!

Quick question - did you use the OpenAIEvalClient or the AzureOpenAIEvalClient? I'm getting this error with the usual gpt-4o deployment in Azure

OpenAI failed to return an assessment corresponding to 0th prompt: Error code: 400 - {'error': {'message': "This model does not support the 'logprobs' parameter.", 'type': 'invalid_request_error', 'param': 'logprobs', 'code': None}}

I used OpenAIEvalClient.

conan1024hao commented 2 months ago

@yosukehigashi I fixed a bug in the code and now it looks okay?

I ran the simulated_annotators function 100 times using the two data items below, with n = k = 5, and calculated the mean and standard deviation. The standard deviation looks good, but I think we still need to evaluate the function on datasets like AlpacaEval.

Prompt Why did my parent not invite me to their wedding?
Model A It is possible that your parent did not invite you to their wedding for a variety of reasons. It could be that they did not want to burden you with the planning or expense of the event, or they may not have had the time or resources to include you. Additionally, it could be that they did not feel comfortable having you attend due to your age or relationship with them. It is important to remember that your parent may have had their own reasons for not inviting you, and it is not your place to judge their choices. If you feel neglected or upset about not being invited, it is best to talk to them about it and express your feelings in a non-accusatory way.
Model B It is likely that they wanted to keep the guest list small and intimate. Maybe they felt that inviting you would make the event too crowded, or that they wanted to keep the event reserved for close friends and family. It could be that they wanted to avoid any potential drama or conflict that could arise from having too many people at the event. Whatever the reason, it’s important to remember that they made this decision out of love and respect for you, and that their decision doesn’t define your relationship.
Mean Score 0.597
Standard Deviation 0.093
Prompt What is the difference between OpenCL and CUDA?
Model A OpenCL and CUDA are two different programming models that are used for parallel computing.OpenCL is a general-purpose并行编程接口 that allows developers to write parallel code that can run on any platform that supportsCL, which includes most modern operating systems and computer systems, including Windows, Linux, and macOS. It provides a lower-level, more flexible API that is more suitable for building large-scale distributed computing systems.CUDA is a specific implementation ofOpenCL that is designed for performance and scalability in devices with multiple GPU(s). It was developed by Nvidia and is widely used for scientific computing, machine learning, and图形图像处理. The CUDA API provides a high-level, more efficient way to write parallel code that can run on top of a GPU, while still maintaining the ability to write code that can be easily adapted for other parallel computing platforms.
Model B OpenCL and CUDA are both programming languages for parallel computing on GPUs, but they differ in several key ways:\n\n1. Architecture: OpenCL is a general-purpose parallel computing language, while CUDA is designed specifically for GPU computing. OpenCL can run on various types of processors, including CPUs, GPUs, and FPGAs, whereas CUDA is only compatible with NVIDIA GPUs.\n2. Language: OpenCL is a C++ language, while CUDA is a C++ language with additional features for GPU computing. CUDA provides additional libraries and tools for handling graphics and other tasks unique to GPUs.\n3. Driver support: OpenCL has a wider range of device support, with support for not just NVIDIA GPUs but also other manufacturers like AMD, Intel, and ARM. CUDA is only available on NVIDIA GPUs, and requires the CUDA Toolkit to be installed on the host machine.\n4. Memory management: OpenCL provides a unified memory model, where all the data is stored in global memory, and the kernel can access any part of it. CUDA, on the other hand, uses a hierarchical memory architecture, where data is divided into host, device, and shared memory.\n5. Performance: OpenCL and CUDA have different strengths and weaknesses in terms of performance. OpenCL is generally faster for numerical computations, while CUDA is faster for graphics and other tasks that require fast memory access. CUDA also has better support for 64-bit floating-point operations, which are essential for some scientific applications.....
Mean Score 0.661
Standard Deviation 0.116
yosukehigashi commented 2 months ago

Wow this is awesome!! The standard deviation looks much better now

I think we still need to evaluate the function on datasets like AlpacaEval.

Is AlpacaEval higher quality than chatbot areana?

yosukehigashi commented 2 months ago

Quick question - did you use the OpenAIEvalClient or the AzureOpenAIEvalClient? I'm getting this error with the usual gpt-4o deployment in Azure

OpenAI failed to return an assessment corresponding to 0th prompt: Error code: 400 - {'error': {'message': "This model does not support the 'logprobs' parameter.", 'type': 'invalid_request_error', 'param': 'logprobs', 'code': None}}

I used OpenAIEvalClient.

Just found that logprobs is supported in Azure now. We just need to use the 2024-06-01 version or above https://learn.microsoft.com/en-us/azure/ai-services/openai/whats-new#july-2024

conan1024hao commented 2 months ago

Is AlpacaEval higher quality than chatbot areana?

I guess the answer is no (AlpacaEval is GPT-annotated)

conan1024hao commented 2 months ago

I guess I completely misunderstood the confidence score equation... ChatGPT is right. 69091725326703_ pic

conan1024hao commented 2 months ago

@yosukehigashi I fixed the confidence score calculation and now the standard deviation for the wedding example becomes around 0.05! I hope this is correct now.

Regarding the handling of enforce_consistency and calculated_confidence: since the confidence score calculated is either A's score or B's score, we can not directly use it to explain the original model's output or the enforce_consistency result. It might be better to override the original output when the calculated_confidece option is turned on. What do you think? Alternatively, maybe we can show both A's and B's scores when the enforce_consistency results in None and say "look, the model's confidence scores for both sides are low!"

conan1024hao commented 2 months ago

@yosukehigashi Thanks for the small fixes! They look good to me.

Here is the process for making processed_chatarena_examples.jsonl. I extracted 1000 samples from the chatbot arena data that meet the following conditions: