video 83 [Vicuna, OS LLM]: Timestamp for From Vicuna to Human-aligned Evaluation: Comparing Open Source Large Language Models

Timestamp Description 00: 01 Agenda 00: 39 Introduction to Data Umbrella 1: 04 Code of Conduct 1: 24 How to support Data Umbrella 4: 58 Introduce the talk and speaker 6: 15 Speaker introduces herself and topic 7: 48 Background 10: 28 Our datasource: ShareGPT 11: 24 The Vicuna Project 12: 23 Evaluation: GPT-4 as a judge 14: 17 Chatbot Arena: Benchmarking LLMs in the wild 16: 17 Next steps: better benchmark 17: 23 Can we really trust LLM as a judge? 17: 43 Overview 21: 03 Limitations 23: 46 Solutions 24: 29 Positive Side: High Agreement with Humans 26: 35 Summary 30: 36 Human Preference Benchmark and Standardized Benchmark 34: 36 Questions 38: 14 Organizer wrap up 38: 52 Links

data-umbrella / event-transcripts

video 83 [Vicuna, OS LLM]: Timestamp for From Vicuna to Human-aligned Evaluation: Comparing Open Source Large Language Models #197