Spenhouet / tensorboard-aggregator

Aggregate multiple tensorboard runs to new summary or csv files
MIT License
166 stars 27 forks source link

[Feature] Support variable step counts for a single scalar #1

Closed Chazzz closed 5 years ago

Chazzz commented 5 years ago

I assume you made this work for specifically your project's file system, but it's a good enough idea that I'll probably make it work for mine.

Traceback (most recent call last): File "../tensorboard_aggregator/aggregator.py", line 117, in aggregate(args.path, args.output, args.subpaths) File "../tensorboard_aggregator/aggregator.py", line 85, in aggregate aggregations_per_key = {key: op(values, axis=0) for key, values in values_per_key.items()} File "../tensorboard_aggregator/aggregator.py", line 85, in aggregations_per_key = {key: op(values, axis=0) for key, values in values_per_key.items()} File "/home/vincent/.virtualenvs/py3env/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 3118, in mean out=out, **kwargs) File "/home/vincent/.virtualenvs/py3env/lib/python3.6/site-packages/numpy/core/_methods.py", line 87, in _mean ret = ret / rcount TypeError: unsupported operand type(s) for /: 'list' and 'int'

Spenhouet commented 5 years ago

No, the code was generalized and should work for everyone. If that is not the case, than I'm very interested in further improving this.

The error you posted seems to happen at the actual aggregation step (aggregating all runs for every step for a single scalar).

Could you please share further details:

Chazzz commented 5 years ago

python ../tensorboard_aggregator/aggregator.py --path /tmp/dorecon --subpaths "['.']"

/tmp/dorecon has two folders, run1 and run2, with an event logfile in each, but they are full of misc. other files, including subfolders with more files. Train and test share the same summary writer, but use separate variable scopes so that they don't get lumped together.

os is ubuntu 18.10, python 3

Name: numpy
Version: 1.16.0
Summary: NumPy is the fundamental package for array computing with Python.
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email: None
License: BSD
Location: /home/<username>/.virtualenvs/py3env/lib/python3.6/site-packages
Requires: 
Required-by: tensorflow-gpu, tensorboard, scipy, pandas, opencv-python, matplotlib, Keras-Preprocessing, Keras-Applications, h5py, gym, atari-py
---
Name: tensorboard
Version: 1.12.2
Summary: TensorBoard lets you watch Tensors Flow
Home-page: https://github.com/tensorflow/tensorboard
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /home/<username>/.virtualenvs/py3env/lib/python3.6/site-packages
Requires: wheel, six, markdown, werkzeug, grpcio, protobuf, numpy
Required-by: tensorflow-gpu
---
Name: pandas
Version: 0.24.1
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: http://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: /home/<username>/.virtualenvs/py3env/lib/python3.6/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: 
---
Name: tensorflow-gpu
Version: 1.12.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /home/<username>/.virtualenvs/py3env/lib/python3.6/site-packages
Requires: absl-py, keras-applications, numpy, wheel, termcolor, six, protobuf, grpcio, keras-preprocessing, tensorboard, astor, gast
Required-by: 
Spenhouet commented 5 years ago

The current implementation assumed that there would always be a subpath for test, validation, training, ... runs. To have no subpath, like in your case (--subpaths ['.']) wasn't supported.

I now added support for this. In addition I added support for the case where a folder contains multiple event files (where this folder is the base path and the event files should be aggregated).

The implementation until now also assumed that a subpath would only contain an event file and no other type of files. Now other files that are non event files are ignored. This should add some robustness to other use cases.

I couldn't directly reproduce the error you got (I got a different error message for your case). Could you please download / pull the updated version and see if your error still appears or confirm if it is solved?

Chazzz commented 5 years ago

Still get the same error, unfortunately, though you probably fixed the second and the third error I would have run into if this one hadn't happened.

If you're interested in reproducing the error, these might help:

folder 1 event files folder 2 event files

Spenhouet commented 5 years ago

There are two issues here.

First the one that you reported. I now could reproduce it. The problem stems from a somehow corrupted run you had. The scalar Losses/ReconLoss has different step counts for both runs. One run was 516 steps and the other 521 steps long. One run endet after 277500 iterations and one after 280000. These steps need to match or else it is not possible to aggregate over the missing portion of steps. That is where the error is coming from.

The other issue is, that even if the one run wouldn't be corrupted it currently wouldn't work. The current implementation assumes that for all scalars all step counts are the same but in your event files most scalars have 11 steps but this one scalar has 521. Im working on the support of different step sizes of different scalars so that only the steps for one scalar need to be equal.

@Chazzz For your failed run (events.out.tfevents.1549493) you will need to rerun it to get a valid run. After I implemented the support for different step counts for different scalars it then should work for you.

Spenhouet commented 5 years ago

That was a relatively big change. If you fixed your run issue, please let me know if it works with the updated version.

Chazzz commented 5 years ago

@Spenhouet Thanks for pointing out the 516 vs 521 mismatch (in addition to the various features and bugfixes which are :+1: :+1: ). Unfortunately its not corruption; the experiment is designed to run a number of episodes of variable length until it reaches a minimum iteration count per epoch. As a consequence, a consistent number of step isn't guaranteed, with the difference being negligible except in this instance (we're talking 0.1% or less difference in longer runs). In this scenario, I would prefer to unify the data and jettison non-core features as needed to accomplish that task.

Edit: To clarify, making the numbers match is trivial, but would negatively affect the quality of the model.

Spenhouet commented 5 years ago

I currently see no way to support this special case. There are many possible problems I see with workaround implementations.

What would you propose as solution?

P.S. Issue title: I don't see it as "fails catastrophically". Having different step sizes per scalar is just not supported (with good reasons). If you now run with different step sizes you will get a polite error message that the step size doesn't match. There is no failure and especially it is not catastrophically. While on the way there were multiple improvements (not bugs) your original issue here is a feature request. I changed it so it better reflects this issue.

Chazzz commented 5 years ago

Hi @Spenhouet,

It is possible that 8b3b603 (specifically 40:42) made the error message less bizarre, I did not rerun the code to confirm, but now that you mention it, you probably already fixed the "catastrophic component". Can you please elaborate on the possible problems you see with workarounds? I don't understand your codebase as well as you do and would appreciate the insights.

While not tested, my basic battle plan is to comment out the aggregate_to_summary op, and rework extract and aggregate_to_csv to gracefully handle a sparse matrix.

Spenhouet commented 5 years ago

This project is meant for researchers and should be usable for paper publications (e.g. exporting multi run aggregates to csv for latex usage). With allowing runs to have diverging step numbers will lead to wrong aggregates and therefore to false paper results. I'm not inclined to make it easier to fall in traps like that. If the aggregation runs without any error message while processing corrupted runs a user could be led to think everything is fine. For example: Averaging over multiple values and wall times while these are from different steps results in wrong results. That is why I don't see any way that I want support for it.

Thanks to you this project got more flexible in regards to other workflows. Thank you for that. For your specific requirement you will have to find a solution yourself. I hope this project is a good start for you.

Chazzz commented 5 years ago

@Spenhouet Thanks for sharing your perspective. The features you mentioned as being problems I will likely disable. Depending on how many changes I end up making I may submit a pull request, though that doesn't mean I expect you to accept it.

ahmeda14960 commented 4 years ago

@Chazzz were you able to create a work around for this issue? I am in a similar situation where my step numbers are off by one and would like to use this tool

Chazzz commented 4 years ago

@ahmeda14960 I didn't work around this issue, however, extract(dpath, subpath) returns (values_per_key, steps, wall_times) and is stored in a dict which can be manipulated before the main for loop in the aggregate function. For example, finding the minimum step value across all paths and setting that as steps in all dict values would ensure equal number of steps are aggregated from each run and avoid the mismatch.