OpenGenerativeAI / llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
https://huggingface.co/spaces/junior-labs/llm-colosseum
MIT License
1.33k stars 159 forks source link

larger model, worse peformance? #30

Open WSPeng opened 7 months ago

WSPeng commented 7 months ago

hi, the leader board shows arger model, worse peformance, is it because of the inference time? smaller model have high action frequency. if so, the bench is not very useful.

i think maybe change the game so it can pause, then we can compare models without bias on inference latency.

StanGirard commented 7 months ago

The goal here is to evaluate an LLM in realtime. We give them the ability to make 3-5 moves ahead of time. Large LLMs can generate more move but yes they take longer.

The goal is to have that inference latency but we could add an option to remove this with a parameter for some games.

Please feel free to open a PR to put this into place but optionnaly and not by default ;)

taozhiyuai commented 7 months ago

in my experience, yes. small model has high token/second, always generate actions. while big model waits for tokens to know how to re-act. @_@

taozhiyuai commented 7 months ago

The record show small model can generate more actions with high token/second

0.5b wins 3 rounds!

Player 1 using: ollama:qwen:14b-chat-v1.5-fp16 Player 2 using: ollama:qwen:0.5b-chat-v1.5-fp16

Round 1

🏟️ (0647) (0)Starting game 🏟️ (0647) (0)Waiting for fight to start INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" 2024-03-30 12:20:26.448 | WARNING | agent.robot:get_moves_from_llm:317 - Many invalid moves: ['Evaluate Opponent', 'Assess Distance for Effective Attacks'] INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: super attack 3 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: low kick INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: super attack 3 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: fireball Player 2 move: super attack 2 Player 2 move: super attack 3 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: jump closer Player 1 move: megapunch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: low kick Player 2 move: medium kick Player 2 move: high kick Player 2 move: medium kick Player 2 move: high kick INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: fireball Player 2 move: high kick Player 2 move: low kick Player 2 move: super attack 2 Player 2 move: super attack 3 Player 2 move: super attack 4 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: fireball Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: fireball Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: low kick INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: super attack 2 Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: low kick Player 2 move: medium kick Player 2 move: high kick Player 2 move: jump away INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: jump away Player 1 move: megapunch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: high kick Player 2 move: low punch Player 2 move: high punch Player 2 move: low kick Player 2 move: low punch Player 2 move: low punch 2024-03-30 12:21:41.329 | WARNING | agent.robot:get_moves_from_llm:317 - Many invalid moves: ['Mid Punch', 'Mid Punch'] INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" 🏟️ (0647) (0)Round won by P2 (0)Moving to next round INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: megapunch Player 1 move: hurricane INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: low punch Player 2 move: medium punch Player 2 move: high punch Player 2 move: low kick Player 2 move: medium kick Player 2 move: high kick Player2 ollama:qwen:0.5b-chat-v1.5-fp16 Daddy won!

—————————

round 2

🏟️ (2b8a) (0)Starting game 🏟️ (2b8a) (0)Waiting for fight to start INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: super attack 3 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: high kick Player 2 move: low kick Player 2 move: low punch Player 2 move: medium punch Player 2 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: super attack 3 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: high punch Player 2 move: low kick Player 2 move: medium punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: jump closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: fireball Player 2 move: jump closer Player 2 move: jump away INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer Player 1 move: jump closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: jump away Player 1 move: super attack 2 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: jump closer Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: megafireball Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: jump closer Player 1 move: megapunch Player 1 move: low punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: megafireball Player 1 move: super attack 2 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: high kick Player 2 move: super attack 2 Player 2 move: super attack 3 Player 2 move: super attack 4 Player 2 move: low punch Player 2 move: medium punch Player 2 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: high punch Player 1 move: jump closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: low punch Player 2 move: medium punch Player 2 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: high punch Player 2 move: high punch Player 2 move: high punch Player 2 move: megapunch Player 2 move: low punch Player 2 move: low punch Player 2 move: low kick 🏟️ (2b8a) (0)Round won by P2 (0)Moving to next round INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: high kick Player 1 move: megapunch Player2 ollama:qwen:0.5b-chat-v1.5-fp16 Daddy won!

———————

Round 3

🏟️ (b34c) (0)Starting game 🏟️ (b34c) (0)Waiting for fight to start INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: super attack 3 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: super attack 3 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: move away Player 2 move: medium punch Player 2 move: super attack 2 Player 2 move: high punch Player 2 move: low kick Player 2 move: medium kick Player 2 move: high kick Player 2 move: jump closer Player 2 move: jump away INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" 2024-03-30 12:28:29.109 | WARNING | agent.robot:get_moves_from_llm:317 - Many invalid moves: ['Move Closer to get into better attacking range', 'Megafireball or Super attack 2 as a powerful offensive option while closing in'] INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: fireball Player 2 move: megapunch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: megafireball Player 1 move: high punch INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: megafireball Player 2 move: super attack 2 Player 2 move: super attack 3 Player 2 move: super attack 4 Player 2 move: low punch Player 2 move: medium punch Player 2 move: high punch Player 2 move: jump closer Player 2 move: jump away INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: move away Player 2 move: high punch Player 2 move: low kick INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: megafireball INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: fireball Player 2 move: high kick Player 2 move: fireball Player 2 move: high kick Player 2 move: fireball Player 2 move: high kick INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: jump closer Player 1 move: megafireball Player 1 move: medium punch Player 1 move: fireball 2024-03-30 12:28:58.413 | WARNING | agent.robot:get_moves_from_llm:317 - Many invalid moves: ['Assess the distance to the opponent', 'If close', 'If far', 'Move Clo'] INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: high kick Player 2 move: low kick Player 2 move: low kick Player 2 move: medium kick Player 2 move: high kick Player 2 move: low kick INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: megafireball Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: low kick Player 2 move: medium kick Player 2 move: high kick INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: super attack 2 INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: jump closer Player 1 move: megafireball INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: fireball Player 2 move: megapunch Player 2 move: hurricane Player 2 move: megafireball Player 2 move: super attack 2 Player 2 move: super attack 3 Player 2 move: super attack 4 Player 2 move: low punch Player 2 move: medium punch 🏟️ (b34c) (0)Round won by P2 (0)Moving to next round INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 1 move: move closer INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK" Player 2 move: move away Player 2 move: super attack 3 Player 2 move: low kick Player 2 move: high kick Player 2 move: jump closer Player 2 move: jump away Player2 ollama:qwen:0.5b-chat-v1.5-fp16 Daddy won!

taozhiyuai commented 7 months ago

WechatIMG83

win rate 44% after 50 rounds

@oulianov

oulianov commented 7 months ago

Very interesting results!