Spenhouet / tensorboard-aggregator

Aggregate multiple tensorboard runs to new summary or csv files
MIT License
166 stars 27 forks source link

No scalars found in event files #6

Closed pirobot closed 3 years ago

pirobot commented 4 years ago

events.out.tfevents.1594678042.pi-dell.10209.12.v2.zip

Describe the bug When running aggregator.py against our event files, we get the error:

  File "/home/patrick/src/tensorboard-aggregator/aggregator.py", line 155, in <module>
    aggregate(path, args.output, args.subpaths)
  File "/home/patrick/src/tensorboard-aggregator/aggregator.py", line 120, in aggregate
    extracts_per_subpath = {subpath: extract(dpath, subpath) for subpath in subpaths}
  File "/home/patrick/src/tensorboard-aggregator/aggregator.py", line 120, in <dictcomp>
    extracts_per_subpath = {subpath: extract(dpath, subpath) for subpath in subpaths}
  File "/home/patrick/src/tensorboard-aggregator/aggregator.py", line 36, in extract
    assert len(set(all_keys)) == 1, "All runs need to have the same scalar keys. There are mismatches in {}".format(all_keys)
AssertionError: All runs need to have the same scalar keys. There are mismatches in []

To Reproduce Run aggregator.py against the attached event file.

Expected behavior Expected summary files to be generated from scalars.

Screenshots None.

Desktop (please complete the following information):

Additional context This is the output we get when we run tensorboard --inspect on the same event file:

tensorboard --inspect --event_file events.out.tfevents.1594678042.pi-dell.10209.12.v2
======================================================================
Processing event files... (this can take a few minutes)
======================================================================

These tags are in events.out.tfevents.1594678042.pi-dell.10209.12.v2:
audio -
histograms -
images -
scalars -
tensor
   Metrics/AverageEpisodeLength
   Metrics/AverageReturn
   Metrics/average_distance_to_nearest_neighbor
   Metrics_vs_EnvironmentSteps/AverageEpisodeLength
   Metrics_vs_EnvironmentSteps/AverageReturn
   Metrics_vs_NumberOfEpisodes/AverageEpisodeLength
   Metrics_vs_NumberOfEpisodes/AverageReturn
======================================================================

Event statistics for events.out.tfevents.1594678042.pi-dell.10209.12.v2:
audio -
graph -
histograms -
images -
scalars -
sessionlog:checkpoint -
sessionlog:start -
sessionlog:stop -
tensor
   first_step           0
   last_step            100
   max_step             1100
   min_step             0
   num_steps            4
   outoforder_steps     [(1100, 85), (1100, 100)]
======================================================================
Spenhouet commented 4 years ago

Hi @pirobot,

as you correctly identified and as the error message states, there are no scalars found. As your inspection output shows, there are no scalars in your even file, there are only tensors. The tensorboard-aggregator only works for scalars and not for tensors.

I hope this helps.

pirobot commented 4 years ago

Hi @Spenhouet Many thanks for the quick response! Forgive me for asking a newbie question, but we appear to be writing our log data as scalars so I'm not sure why they are showing up as tensors. Here is the code we use to write the data. Can you see anything obvious we are doing incorrectly?

if global_step_val % eval_interval == 0:
    metric_utils.compute_summaries(
        eval_metrics,
        eval_py_env,
        eval_py_policy,
        num_episodes=num_eval_episodes,
        global_step=0,
        callback=eval_metrics_callback,
        tf_summaries=True,
        log=True,
    )

    with eval_summary_writer.as_default(), tf.compat.v2.summary.record_if(True):
        with tf.name_scope('Metrics/'):
            episodes = eval_py_env.get_stored_episodes()
            episodes = [episode for sublist in episodes for episode in sublist][:num_eval_episodes]
            metrics = episode_utils.get_metrics(episodes)
            for key in sorted(metrics.keys()):
                print(key, ':', metrics[key])
                metric_op = tf.compat.v2.summary.scalar(name=key,
                                                    data=metrics[key],
                                                    step=global_step_val)
                sess.run(metric_op)

    sess.run(eval_summary_flush_op)

where we define eval_summary_writer as follows:

    eval_summary_writer = tf.compat.v2.summary.create_file_writer(
        eval_dir, flush_millis=summaries_flush_secs * 1000)
    eval_metrics = [
        batched_py_metric.BatchedPyMetric(
            py_metrics.AverageReturnMetric,
            metric_args={'buffer_size': num_eval_episodes},
            batch_size=num_parallel_environments_eval),
        batched_py_metric.BatchedPyMetric(
            py_metrics.AverageEpisodeLengthMetric,
            metric_args={'buffer_size': num_eval_episodes},
            batch_size=num_parallel_environments_eval),
    ]
    eval_summary_flush_op = eval_summary_writer.flush()
Spenhouet commented 4 years ago

When I did use TensorFlow (switched to pytorch) I did save scalars with tf.summary.scalar(name, data, step=None) as documented here: https://www.tensorflow.org/api_docs/python/tf/summary/scalar

You are using tf.compat.v2.summary.scalar. I'm not sure about the differences. The migration guide seems to contain some suggestions: https://www.tensorflow.org/tensorboard/migrate Maybe just try tf.summary.scalar or tf.compat.v1.summary.scalar and see if this works?

EDIT: I'm also not familiar with the way you create a file writer. Not sure what the eval_metrics does. Maybe try a simple file writer like:

result_dir = Path('./res')
train_writer = tf.summary.FileWriter(result_dir / 'train')
eval_writer = tf.summary.FileWriter(result_dir / 'eval')

EDIT2: I'm not up-to-date with the changes with respect to TensorFlow 2. Please adjust the above examples if necessary.

pirobot commented 4 years ago

OK thanks for the suggestions! I'll try these and see how it goes.