Regarding the issue with evaluate.py

xlxwalex commented 2 months ago

Hello, I have two questions regarding evaluate.py:

I noticed that investigate.py calls evaluate.py when gathering data at the end, which results in the following error:
```
File "/holmes_evaluation/src/evaluate.py", line 24, in main
probing_dataset, model_name, encoding, control_task_type, sample_size, seed, num_hidden_layers, _ = (
ValueError: too many values to unpack (expected 8)
```
This seems to be due to an extra hyphen in the replace on Line 26. Modifying it to the following code should correctly export the results to a CSV table:
```
.replace(f"{result_folder}/{version}", "")
```
In the Leaderboard, each model should have a score for each task category, but evaluate.py does not appear to support this feature. Could you please advise on how I can obtain the model's scores for Overall, Discourse, Morphology, Reasoning, Semantics, and Syntax?

I am eagerly awaiting your reply.

holmesbenchmark commented 1 month ago

Hi,

The evaluation should work now. Please make sure to re-download the data and clean the dumps and results directory. Would be great if you could confirm that it also work on your side.
We work at the moment on integrating the full evaluation pipeline into this repository including the Explorer and Leaderboard. In the meanwhile, you can send us the results on holmesbenchmark@gmail.com and we can provide you with the numbers.

Regarding 2), we keep this issue open and update you as soon we finished the integration.

xlxwalex commented 1 month ago

Thank you for the prompt reply. The new code can correctly output the result CSV file.

Regarding the second point, I have a question: I noticed that evaluate.py on Line 16 selects the first 10 result_files[:10] to display. If I need to obtain the results for each category by sending an email, would it be sufficient to simply remove the [:10], export the result CSV, and then send it to you?

holmesbenchmark commented 1 month ago

Good to hear!

Removed the mentioned [:10]. Yes just send us the csv file, we can evaluate it on our side the next days.

xlxwalex commented 1 month ago

Thank you very much. After I finish testing, I will send the results to you.

Holmes-Benchmark / holmes-evaluation

Regarding the issue with evaluate.py #2