Bug: Unable to calculate metrics from saved predictions using --predict_only and --from_log

Description:

I'm experiencing an issue with calculating metrics after saving predictions using --predict_only and then attempting to compute metrics with --from_log. It appears that the from_log model is not functioning correctly, possibly due to recent changes in the log format.

Steps to Reproduce:

Run prediction and save outputs:

I used the following script to generate predictions and save them:

python3 -m accelerate.commands.launch \
   --num_processes=2 \
   -m lmms_eval \
   --model $model \
   --model_args logs=./logs_vlm/$model/checkpoints/,model_name=$model \
   --tasks $task \
   --batch_size 1 \
   --log_samples \
   --limit 10 \
   --predict_only \
   --output_path "./logs_vlm/$model/$task" \
   --verbosity=DEBUG

Attempt to calculate metrics from saved outputs:

Then, I tried to calculate metrics using the saved logs:

python3 -m accelerate.commands.launch \
   --num_processes=2 \
   -m lmms_eval \
   --model from_log \
   --model_args logs=./logs_vlm/$model/$task/,model_name=$model \
   --tasks $task \
   --batch_size 1 \
   --log_samples \
   --limit 10 \
   --output_path "./logs_vlm/$model/$task" \
   --verbosity=DEBUG

Expected Behavior:

After running the first script with --predict_only, predictions should be saved to the specified --output_path.
Running the second script with --model from_log should load these saved predictions and compute the evaluation metrics.

Actual Behavior:

The second script does not compute the metrics as expected.
It seems that the from_log model is not correctly processing the saved logs.

Environment:

lmms-eval: 0.2.4

Request:

Please investigate and fix the from_log model so that it correctly processes logs and calculates metrics.
If there have been changes to the log format, update from_log to be compatible with the new format.
Alternatively, provide guidance on how to calculate metrics from saved outputs with the current version.

I believe addressing this issue is important for workflows that separate prediction and evaluation phases. Moreover, enhancing this functionality will improve support for offline mode, as discussed in issue #335

Thank you for your assistance!

EvolvingLMMs-Lab / lmms-eval

Bug: Unable to calculate metrics from saved predictions using --predict_only and --from_log #337