Wandb logger can't handle groups with heterogenous metrics

dmitrii-palisaderesearch commented 3 months ago

Hi,

The wandb logger chokes if a group contains some tasks that output numbers and some that output strings. This is either a bug in WandbLogger.log_eval_samples or in the openllm group (maybe group tasks ought to be homogenous by design).

lm-eval --tasks openllm \
        --wandb_args entity=XXX,project=XXX \
        # use any model to reproduce

Traceback

TypeError                                 Traceback (most recent call last)
 in 

[/usr/local/lib/python3.10/site-packages/lm_eval/logging_utils.py](https://localhost:8080/#) in log_eval_samples(self, samples)
    395                 self._log_samples_as_artifact(eval_preds, task_name)
    396 
--> 397             self.run.log({f"{group}_eval_results": grouped_df})
    398 
    399 

12 frames
[/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py](https://localhost:8080/#) in wrapper(self, *args, **kwargs)
    418                     return cls.Dummy()
    419 
--> 420             return func(self, *args, **kwargs)
    421 
    422         return wrapper

[/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py](https://localhost:8080/#) in wrapper_fn(self, *args, **kwargs)
    369             def wrapper_fn(self: Type["Run"], *args: Any, **kwargs: Any) -> Any:
    370                 if not getattr(self, "_is_finished", False):
--> 371                     return func(self, *args, **kwargs)
    372 
    373                 default_message = (

[/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py](https://localhost:8080/#) in wrapper(self, *args, **kwargs)
    359                     raise e
    360                 cls._is_attaching = ""
--> 361             return func(self, *args, **kwargs)
    362 
    363         return wrapper

[/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py](https://localhost:8080/#) in log(self, data, step, commit, sync)
   1836                 repeat=False,
   1837             )
-> 1838         self._log(data=data, step=step, commit=commit)
   1839 
   1840     @_run_decorator._noop_on_finish()

[/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py](https://localhost:8080/#) in _log(self, data, step, commit)
   1600             raise ValueError("Key values passed to `wandb.log` must be strings.")
   1601 
-> 1602         self._partial_history_callback(data, step, commit)
   1603 
   1604         if step is not None:

[/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py](https://localhost:8080/#) in _partial_history_callback(self, row, step, commit)
   1472             not_using_tensorboard = len(wandb.patched["tensorboard"]) == 0
   1473 
-> 1474             self._backend.interface.publish_partial_history(
   1475                 row,
   1476                 user_step=self._step,

[/usr/local/lib/python3.10/site-packages/wandb/sdk/interface/interface.py](https://localhost:8080/#) in publish_partial_history(self, data, user_step, step, flush, publish_step, run)
    570         run = run or self._run
    571 
--> 572         data = history_dict_to_json(run, data, step=user_step, ignore_copy_err=True)
    573         data.pop("_step", None)
    574 

[/usr/local/lib/python3.10/site-packages/wandb/sdk/data_types/utils.py](https://localhost:8080/#) in history_dict_to_json(run, payload, step, ignore_copy_err)
     50             )
     51         else:
---> 52             payload[key] = val_to_json(
     53                 run, key, val, namespace=step, ignore_copy_err=ignore_copy_err
     54             )

[/usr/local/lib/python3.10/site-packages/wandb/sdk/data_types/utils.py](https://localhost:8080/#) in val_to_json(run, key, val, namespace, ignore_copy_err)
     81 
     82     if util.is_pandas_data_frame(val):
---> 83         val = wandb.Table(dataframe=val)
     84 
     85     elif util.is_matplotlib_typename(typename) or util.is_plotly_typename(typename):

[/usr/local/lib/python3.10/site-packages/wandb/data_types.py](https://localhost:8080/#) in __init__(self, columns, data, rows, dataframe, dtype, optional, allow_mixed_types)
    207         # Explicit dataframe option
    208         if dataframe is not None:
--> 209             self._init_from_dataframe(dataframe, columns, optional, dtype)
    210         else:
    211             # Expected pattern

[/usr/local/lib/python3.10/site-packages/wandb/data_types.py](https://localhost:8080/#) in _init_from_dataframe(self, dataframe, columns, optional, dtype)
    264         self._make_column_types(dtype, optional)
    265         for row in range(len(dataframe)):
--> 266             self.add_data(*tuple(dataframe[col].values[row] for col in self.columns))
    267 
    268     def _make_column_types(self, dtype=None, optional=True):

[/usr/local/lib/python3.10/site-packages/wandb/data_types.py](https://localhost:8080/#) in add_data(self, *data)
    408 
    409         # Update the table's column types
--> 410         result_type = self._get_updated_result_type(data)
    411         self._column_types = result_type
    412 

[/usr/local/lib/python3.10/site-packages/wandb/data_types.py](https://localhost:8080/#) in _get_updated_result_type(self, row)
    432         result_type = current_type.assign(incoming_row_dict)
    433         if isinstance(result_type, _dtypes.InvalidType):
--> 434             raise TypeError(
    435                 "Data row contained incompatible types:\n{}".format(
    436                     current_type.explain(incoming_row_dict)

TypeError: Data row contained incompatible types:
{'id': 0, 'data': "Question: Jen and Tyler are gymnasts practicing flips. Jen is practicing the triple-flip while Tyler is practicing the double-flip. Jen did sixteen triple-flips during practice. Tyler flipped in the air half the number of times Jen did. How many double-flips did Tyler do?\nAnswer: Jen did 16 triple-flips, so she did 16 * 3 = <<16*3=48>>48 flips.\nTyler did half the number of flips, so he did 48 / 2 = <<48/2=24>>24 flips.\nA double flip has two flips, so Tyler did 24 / 2 = <<24/2=12>>12 double-flips.\n#### 12\n\nQuestion: Four people in a law firm are planning a party. Mary will buy a platter of pasta for $20 and a loaf of bread for $2. Elle and Andrea will split the cost for buying 4 cans of soda which cost $1.50 each, and chicken wings for $10. Joe will buy a cake that costs $5. How much more will Mary spend than the rest of the firm put together?\nAnswer: Mary will spend $20 + $2 = $<<20+2=22>>22.\nElle and Andrea will spend $1.5 x 4 = $<<1.5*4=6>>6 for the soda.\nElle and Andrea will spend $6 + $10 = $<<6+10=16>>16 for the soda and chicken wings.\nElle, Andrea, and Joe together will spend $16 + $5 = $<<16+5=21>>21.\nSo, Mary will spend $22 - $21 = $<<22-21=1>>1 more than all of them combined.\n#### 1\n\nQuestion: A charcoal grill burns fifteen coals to ash every twenty minutes of grilling. The grill ran for long enough to burn three bags of coals. Each bag of coal contains 60 coals. How long did the grill run?\nAnswer: The grill burned 3 * 60 = <<3*60...
Key 'labels':
    String not assignable to None or Number
        String not assignable to None
    and
        String not assignable to Number
Key 'raw_predictions':
    String not assignable to None or Number
        String not assignable to None
    and
        String not assignable to Number
Key 'filtered_predictions':
    String not assignable to None or Number
        String not assignable to None
    and
        String not assignable to Number

WandbLogger.log_eval_samples concats tasks outputs into one big dataframe without converting types, and wandb balks at this.

haileyschoelkopf commented 3 months ago

@lintangsutawika will #1741 fix this, do you think?

We're working on making groups more clear--namely, making a distinction between homogenous groups which will report their aggregated scores on a given metric, and heterogenous "groups" (--> tag s) which are just convenience names / tags used when invoking a number of related tasks at once.

lintangsutawika commented 3 months ago

I'm not yet able to reproduce this. It seems to work fine with the latest version in main.

lm-eval \
    --model_args "pretrained=gpt2" \
    --task openllm \
    --limit 4 \
    --device cpu \
    --wandb_args project=xxx

haileyschoelkopf commented 3 months ago

I think --log_samples may be required to reproduce?

dmitrii-palisaderesearch commented 3 months ago

Yes, it is. Sorry, missed it in the repro.

On Wed, Jun 12, 2024 at 8:14 PM Hailey Schoelkopf @.***> wrote:

I think --log_samples may be required to reproduce?

— Reply to this email directly, view it on GitHub https://github.com/EleutherAI/lm-evaluation-harness/issues/1958#issuecomment-2163434296, or unsubscribe https://github.com/notifications/unsubscribe-auth/BFOTULQF4S4W2QQZM7UM5B3ZHBXWTAVCNFSM6AAAAABJGCX2MSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTGQZTIMRZGY . You are receiving this because you authored the thread.Message ID: @.***>

lintangsutawika commented 3 months ago

Still seems to work on from main. @dmitrii-palisaderesearch are you using the latest main?

lm-eval \
    --model_args "pretrained=gpt2" \
    --task openllm \
    --limit 4 \
    --device cpu \
    --wandb_args project=xxx \
    --log_samples --output test_wandb/

dmitrii-palisaderesearch commented 3 months ago

Hey, here are the repro colab notebooks:

main
v0.4.2

lm-eval CLI eats the exception and hides it under an INFO log entry, so it's a little hard to see—I added an lm_eval lib call as well so you can see the exact line that throws and a traceback.

BTW thanks for your repro cmd, it's extremely helpful :)

lintangsutawika commented 3 months ago

Thanks, I see what the issue is now. It's a matter of not able to reconcile different data types that belong in the same column which can happen when calling a set of tasks that have different data types. I think the solution here is to not concatenate different tasks together. Unless this is actually disireable?

dmitrii-palisaderesearch commented 3 months ago

So it would be convenient to get samples from

concrete mmlu tasks (e.g. mmlu_abstract_algebra)
all of mmlu, with each sample tagged with its concrete task

OTOH, I don't feel like getting samples from "open llm leaderboard" will be useful: aggregate metrics suffice there.

This is probably what #1741 will do.

haileyschoelkopf commented 3 months ago

For the specific MMLU case, in order to support still having the full list logged perhaps we can have a flag in group configs that retains samples all together for logging, and otherwise not log groups' samples? @lintangsutawika do you think this seems too contrived?

lintangsutawika commented 3 months ago

For groups like MMLU, the wandb issue shouldn't occur since all subtasks share the same format.

On grouping samples, the bigger issue maybe how we log the results.json and samples.json file. I guess this means that it's not a quick fix but one that should suit long-term usability.

Btw @dmitrii-palisaderesearch , if you want the samples from mmlu tasks. Would running just mmlu be suffice?

dmitrii-palisaderesearch commented 3 months ago

Sure, this works great. I just wanted to assemble my benchmark into one big yaml config and hit this.

EleutherAI / lm-evaluation-harness

Wandb logger can't handle groups with heterogenous metrics #1958