chigkim / Ollama-MMLU-Pro

Apache License 2.0
64 stars 13 forks source link

feat: run in parallel #1

Closed sammcj closed 4 months ago

sammcj commented 4 months ago

Add the ability to run tests in parallel

Great tool!

chigkim commented 4 months ago

What formatter did you use? Most changes are format, so hard to track. I'm also using tabs for indentation instead of spaces for accessibility.

chigkim commented 4 months ago

By the way, just continuing the conversation from Reddit... Re aborting and resuming, I'll test some more, but I think it's Ollama problem, not the script. Again, thanks for the PR!

sammcj commented 4 months ago

Oh sorry I missed the tabs, I’ll update the PR when I’m back home later.

FYI Python black w/ VSCode for formatting, it respects .editor config but when there isn’t one I must have it set to format on save.

sammcj commented 4 months ago

I've implemented locking which should help, I think it's resuming properly now but might benefit from some more testing in anger 😄

sammcj commented 4 months ago

I've updated .gitignore to add the eval_results directory and also added a .editorconfig to ensure no one else gets caught out by the use of tabs and does you a dodgy PR like I did 🤣

chigkim commented 4 months ago

Merged! Thanks!

chigkim commented 4 months ago

This probably doesn't have to do with parallel, but I'm just asking if you might have some idea. Last night, I ran a benchmark with --parallel 2, and all the responses in eval_results had:

"@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"

The score result was:

{"computer science": {"corr": 43.0, "wrong": 367.0, "acc": 0.1048780487804878}, "total": {"corr": 43.0, "wrong": 367.0, "acc": 0.1048780487804878}}

If the model responded with all @@@, the score would have been 0, but I couldn't find any response that wasn't @@@.

>>> res = json.load(open("computer science_result.json"))
>>> len(res)
410
>>> reg = r"^@+$"
>>> search = [r['response'] for r in res if re.match(reg, r['response'])]
>>> len(search)
410
>>> search = [r['response'] for r in res if not re.match(reg, r['response'])]
>>> len(search)
0

I'm attaching the both response and score files.

I couldn't find anything that might cause this when saving a file either. I tried a few times after, but I couldn't reproduce the result. lol I'm just wondering if you have any idea? Something has to do with Ollama?