clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
19 stars 26 forks source link

running game fails silently on wrong parameters #5

Closed davidschlangen closed 8 months ago

davidschlangen commented 8 months ago

I'm running the following: python scripts/cli.py -m gpt-3.5-turbo run taboo

I get the following output:

      _                _                     _     
     | |              | |                   | |    
  ___| | ___ _ __ ___ | |__   ___ _ __   ___| |__  
 / __| |/ _ \ '_ ` _ \| '_ \ / _ \ '_ \ / __| '_ \ 
| (__| |  __/ | | | | | |_) |  __/ | | | (__| | | |
 \___|_|\___|_| |_| |_|_.__/ \___|_| |_|\___|_| |_|

No module named 'torch'
Cannot load 'backends.huggingface_local_api'. Please make sure that the file exists.
Loaded backends: anthropic,openai,alephalpha
2023-10-31 13:32:05,221 - benchmark.run - INFO - Run game 1 of 1: taboo
2023-10-31 13:32:05,221 - benchmark.run - INFO - Run experiment 1 of 3: high_en

But there is no records folder to be found anywhere.

davidschlangen commented 8 months ago

Turns out the invocation was wrong (should be model pairs). This should not fail silently. There must be a way of knowing that the above is not the expected output from a successful run.

Solution: validate command line parameters when parsing them, and fail meaningfully.

phisad commented 8 months ago

The information was only given in clembench.log but not written to stdout. With commit 614215c the error message is now also written to the stdout.

python3 scripts/cli.py -m asdasf run hellogame
2023-11-01 15:22:31,253 - benchmark.run - INFO - Run game 1 of 1: hellogame
2023-11-01 15:22:31,253 - benchmark.run - INFO - Run experiment 1 of 1: greet_en
2023-11-01 15:22:31,253 - benchmark.run - ERROR - Invalid model pairing ['asdasf'] for a multi-player game. For single-player expected only a single model, otherwise a pair.

The program would otherwise behave as before and simply notify the user via the console.

python3 scripts/cli.py -m gpt-4 run taboo 
Loaded backends: anthropic,openai,alephalpha
2023-11-01 15:31:01,716 - benchmark.run - INFO - Run game 1 of 1: taboo
2023-11-01 15:31:01,716 - benchmark.run - INFO - Run experiment 1 of 3: high_en
2023-11-01 15:31:01,716 - benchmark.run - ERROR - Invalid model pairing ['gpt-4'] for a multi-player game. For single-player expected only a single model, otherwise a pair.

The question is, if this is enough for now?

davidschlangen commented 8 months ago

yes, that should suffice for now