Closed conan1024hao closed 2 months ago
Thanks for drafting this together so quickly!!
Quick question - did you use the OpenAIEvalClient or the AzureOpenAIEvalClient? I'm getting this error with the usual gpt-4o deployment in Azure
OpenAI failed to return an assessment corresponding to 0th prompt: Error code: 400 - {'error': {'message': "This model does not support the 'logprobs' parameter.", 'type': 'invalid_request_error', 'param': 'logprobs', 'code': None}}
Thanks for drafting this together so quickly!!
Quick question - did you use the OpenAIEvalClient or the AzureOpenAIEvalClient? I'm getting this error with the usual gpt-4o deployment in Azure
OpenAI failed to return an assessment corresponding to 0th prompt: Error code: 400 - {'error': {'message': "This model does not support the 'logprobs' parameter.", 'type': 'invalid_request_error', 'param': 'logprobs', 'code': None}}
I used OpenAIEvalClient.
@yosukehigashi I fixed a bug in the code and now it looks okay?
I ran the simulated_annotators
function 100 times using the two data items below, with n = k = 5, and calculated the mean and standard deviation. The standard deviation looks good, but I think we still need to evaluate the function on datasets like AlpacaEval.
Prompt | Why did my parent not invite me to their wedding? |
---|---|
Model A | It is possible that your parent did not invite you to their wedding for a variety of reasons. It could be that they did not want to burden you with the planning or expense of the event, or they may not have had the time or resources to include you. Additionally, it could be that they did not feel comfortable having you attend due to your age or relationship with them. It is important to remember that your parent may have had their own reasons for not inviting you, and it is not your place to judge their choices. If you feel neglected or upset about not being invited, it is best to talk to them about it and express your feelings in a non-accusatory way. |
Model B | It is likely that they wanted to keep the guest list small and intimate. Maybe they felt that inviting you would make the event too crowded, or that they wanted to keep the event reserved for close friends and family. It could be that they wanted to avoid any potential drama or conflict that could arise from having too many people at the event. Whatever the reason, it’s important to remember that they made this decision out of love and respect for you, and that their decision doesn’t define your relationship. |
Mean Score | 0.597 |
Standard Deviation | 0.093 |
Prompt | What is the difference between OpenCL and CUDA? |
---|---|
Model A | OpenCL and CUDA are two different programming models that are used for parallel computing.OpenCL is a general-purpose并行编程接口 that allows developers to write parallel code that can run on any platform that supportsCL, which includes most modern operating systems and computer systems, including Windows, Linux, and macOS. It provides a lower-level, more flexible API that is more suitable for building large-scale distributed computing systems.CUDA is a specific implementation ofOpenCL that is designed for performance and scalability in devices with multiple GPU(s). It was developed by Nvidia and is widely used for scientific computing, machine learning, and图形图像处理. The CUDA API provides a high-level, more efficient way to write parallel code that can run on top of a GPU, while still maintaining the ability to write code that can be easily adapted for other parallel computing platforms. |
Model B | OpenCL and CUDA are both programming languages for parallel computing on GPUs, but they differ in several key ways:\n\n1. Architecture: OpenCL is a general-purpose parallel computing language, while CUDA is designed specifically for GPU computing. OpenCL can run on various types of processors, including CPUs, GPUs, and FPGAs, whereas CUDA is only compatible with NVIDIA GPUs.\n2. Language: OpenCL is a C++ language, while CUDA is a C++ language with additional features for GPU computing. CUDA provides additional libraries and tools for handling graphics and other tasks unique to GPUs.\n3. Driver support: OpenCL has a wider range of device support, with support for not just NVIDIA GPUs but also other manufacturers like AMD, Intel, and ARM. CUDA is only available on NVIDIA GPUs, and requires the CUDA Toolkit to be installed on the host machine.\n4. Memory management: OpenCL provides a unified memory model, where all the data is stored in global memory, and the kernel can access any part of it. CUDA, on the other hand, uses a hierarchical memory architecture, where data is divided into host, device, and shared memory.\n5. Performance: OpenCL and CUDA have different strengths and weaknesses in terms of performance. OpenCL is generally faster for numerical computations, while CUDA is faster for graphics and other tasks that require fast memory access. CUDA also has better support for 64-bit floating-point operations, which are essential for some scientific applications..... |
Mean Score | 0.661 |
Standard Deviation | 0.116 |
Wow this is awesome!! The standard deviation looks much better now
I think we still need to evaluate the function on datasets like AlpacaEval.
Is AlpacaEval higher quality than chatbot areana?
Quick question - did you use the OpenAIEvalClient or the AzureOpenAIEvalClient? I'm getting this error with the usual gpt-4o deployment in Azure
OpenAI failed to return an assessment corresponding to 0th prompt: Error code: 400 - {'error': {'message': "This model does not support the 'logprobs' parameter.", 'type': 'invalid_request_error', 'param': 'logprobs', 'code': None}}
I used OpenAIEvalClient.
Just found that logprobs is supported in Azure now. We just need to use the 2024-06-01
version or above
https://learn.microsoft.com/en-us/azure/ai-services/openai/whats-new#july-2024
Is AlpacaEval higher quality than chatbot areana?
I guess the answer is no (AlpacaEval is GPT-annotated)
I guess I completely misunderstood the confidence score equation... ChatGPT is right.
@yosukehigashi I fixed the confidence score calculation and now the standard deviation for the wedding example becomes around 0.05! I hope this is correct now.
Regarding the handling of enforce_consistency
and calculated_confidence
: since the confidence score calculated is either A's score or B's score, we can not directly use it to explain the original model's output or the enforce_consistency
result. It might be better to override the original output when the calculated_confidece
option is turned on. What do you think? Alternatively, maybe we can show both A's and B's scores when the enforce_consistency
results in None
and say "look, the model's confidence scores for both sides are low!"
@yosukehigashi Thanks for the small fixes! They look good to me.
Here is the process for making processed_chatarena_examples.jsonl
.
I extracted 1000 samples from the chatbot arena data that meet the following conditions:
winner
column is not "Tie".pairwise_comparison_400.json
(the pairwise benchmark dataset we are currently using).
Some examples (K = 5, N = 10):