apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.8k forks source link

Building a visualization tool for MXNet #4003

Closed zihaolucky closed 6 years ago

zihaolucky commented 7 years ago

Hi hackers,

I've started working on building a visualization tool for MXNet, like TensorBoard for TensorFlow. As @piiswrong suggested in #3306 that 'try to strip TensorBoard out of tensorflow' and I'm going to work on this direction. Here're some of my notes after reading TensorBoard‘s documentation and searching its usage on the web, feel free to comment below.

Motivation and some backgrounds

I've tried to visualize the data using matplotlib and a bunch of helper tools like tsne on my daily work and I feel tired in rendering and adjusting the size/color of the images. Besides, it's not easy to share this data with my friends.

While TensorBoard provides good solutions for our daily use cases, such as learning curves, parameters/embedding visualization, also it's easy to share. See TensorBoard for more.

Daily use cases

I think these could satisfy most people and it's already supported by TensorBoard with tf.scalar_summary, tf.image_summary, tf.histogram_summary and tensorboard.plugins.projector

TensorBoard usage

Some snippets from a tutorial on how to use TensorBoard.

# create a summary for our cost and accuracy
tf.scalar_summary("cost", cross_entropy)
tf.scalar_summary("accuracy", accuracy)

# merge all summaries into a single "operation" which we can execute in a session 
summary_op = tf.merge_all_summaries()

with tf.Session() as sess:
    # variables need to be initialized before we can use them
    sess.run(tf.initialize_all_variables())

    # create log writer object
    writer = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph())

    # perform training cycles
    for epoch in range(training_epochs):

        # number of batches in one epoch
        batch_count = int(mnist.train.num_examples/batch_size)

        for i in range(batch_count):
            batch_x, batch_y = mnist.train.next_batch(batch_size)

            # perform the operations we defined earlier on batch
            _, summary = sess.run([train_op, summary_op], feed_dict={x: batch_x, y_: batch_y})

            # write log
            writer.add_summary(summary, epoch * batch_count + i)

The logic above is quite clear, where the accuracy and cost get updated every time when sess.run is called and return a [Summary]() which is feed into log through SummaryWriter.

Feasibility

1.Easy to borrow and directly use in MXNet?

I've successfully visualized the 'makeup' curve using the code below:

counter = tf.Variable(1.0)

# create a summary for counter
tf.scalar_summary("counter", counter)

# merge all summaries into a single "operation" which we can execute in a session
summary_op = tf.merge_all_summaries()

with tf.Session() as sess:
    # variables need to be initialized before we can use them
    sess.run(tf.initialize_all_variables())

    # create log writer object
    writer = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph())

    # perform training cycles
    for epoch in range(100):

        # perform the operations we defined earlier on batch
        counter.assign(epoch + np.random.standard_normal()).eval()
        summary = sess.run(summary_op)

        # write log
        writer.add_summary(summary, epoch)

So it means we could pass something common, here is numpy array and normal int, and reuse most of the code. I would discuss possible routes for creating an interface to connect MXNet and TensorBoard and I need your advice. But let's keep it simple now.

2.Could be striped alone?

From this README, I guess TensorBoard could be built independently?

Or, if you are building from source:

bazel build tensorflow/tensorboard:tensorboard
./bazel-bin/tensorflow/tensorboard/tensorboard --logdir=path/to/logs

TODO

To prove we can use TensorBoard in a dummy way:

To keep our code clean and lightweight:

Or we could install entire TF together with MXNet? Is that acceptable? I think it's okay but not good for our users and make this visualization tool too heavy, cause we also run core code in TensorFlow(the sess and Tensor.eval is actually computed by TF). But it depends on our checks, hard to tell.

Or any other way to workaround? As the summary is a proto in writer.add_summary(summary, epoch * batch_count + i) that means we could only use summaryWriter without using the computation of TF. It's possible because the doc in SummaryWriter.add_summary:

  def add_summary(self, summary, global_step=None):
    """Adds a `Summary` protocol buffer to the event file.

    This method wraps the provided summary in an `Event` protocol buffer
    and adds it to the event file.

    You can pass the result of evaluating any summary op, using
    [`Session.run()`](client.md#Session.run) or
    [`Tensor.eval()`](framework.md#Tensor.eval), to this
    function. Alternatively, you can pass a `tf.Summary` protocol
    buffer that you populate with your own data. The latter is
    commonly done to report evaluation results in event files.

    Args:
      summary: A `Summary` protocol buffer, optionally serialized as a string.
      global_step: Number. Optional global step value to record with the
        summary.
    """

If we decide to borrow TensorBoard:

piiswrong commented 7 years ago

@leopd @mli

piiswrong commented 7 years ago

The way tensorboard works is it takes in a log file printed in a specific format and then renders them. So we don't necessarily need tf.scalar_summary and tf.session to use it. We simply need to print log in the same format and run tensorboard.

Here is what I think would be a ideal solution:

  1. we strip a minimum set of files related to tensorboard out of tensorflow and build it with Makefile or cmake, not bazel.
  2. we modify mxnet's logging format so that it conforms with tensorboard.

But I haven't looked into this in-depth so this might be hard/impossible. So feel free to do anything that works for you first. We can discuss whether we want to merge it into mxnet or provide it as a separate solution afterwards.

zihaolucky commented 7 years ago

Yes, tensorboard only requires the proto of the logger results, but I didn't find the entrance to create a Summary object, which is return directly by scala_summary(an tensorflow op) and that means we have to run tf.run. I'm trying to walk around this.

I would look into this in the coming two weeks.

tqchen commented 7 years ago

I think tensorboard is relatively isolated. Last time i see the code, only the proto of the logger file is needed

leopd commented 7 years ago

My memory of using tensorboard is that those logfiles quickly get extremely large. Do people really share those logfiles with each other? It also made me worry that the huge amount of I/O would limit performance -- which would be more of an issue with MXNet than TF. So that's something else we can experiment/measure: what kind of IO bandwidth would be needed to produce these logfiles.

tqchen commented 7 years ago

c.f. Example of using Tensorboard in minpy

https://github.com/dmlc/minpy/blob/6e528ceab34f114f6c486c47b5e4cd417d8c03d5/docs/tutorial/visualization_tutorial/minpy_visualization.ipynb

@jermainewang maybe have more comments on details

zihaolucky commented 7 years ago

@tqchen @jermainewang Thanks for the reference, and I've found an API for scalar_summary without running tensorflow's op here but it still uses SummaryWriter and EventWriter for the tensorboard log file.

Although it has only scalar_summary now but seems they're actively working on it, I've sent an email to ask Dan if they have any plan on this direction to support more types of summary, but I haven't get feedback yet.

jermainewang commented 7 years ago

The minpy's way using tensorboard could be migrated to mxnet quite easily. There are majorly three components:

  1. proto files: summary proto and event proto. They could be directly used. The summary writer could directly be borrowed from TF's python side.
  2. EventWriter logic. Tensorflow has a EventWriter in c++ but could be easily rewritten in python.
  3. RecordFileWriter logic. After serializing the event proto, TF uses recordio to write to disk. This part should be replace by our implementation, and that's it.

We plan to put the codes here: https://github.com/dmlc/minpy/tree/visualize/minpy/visualize . I will ping you again after it is updated.

jermainewang commented 7 years ago

The related PR is still under review here: https://github.com/dmlc/minpy/pull/87

zihaolucky commented 7 years ago

@jermainewang That's great!

mufeili commented 7 years ago

Hi, I've finished doing the scalar summary part and is currently exploring image summaries and histogram summaries. We did not plan to do audio and graph summaries for minpy as minpy does not use a computational graph. But that should work for mxnet.

I also realize there is a new section in TensorBoard after the release of TensorFlow v0.12 for word embeddings, which is super cool: https://www.tensorflow.org/versions/master/how_tos/embedding_viz/index.html#tensorboard-embedding-visualization.

zihaolucky commented 7 years ago

Hey guys, I've finished the first one in the TODOs with generous help from @mufeili and @jermainewang

But it still requires a writer/RecordFileWriter from TF, I would submit the code as I finish the writer later.

zihaolucky commented 7 years ago

@mufeili Could you take a look at this issue? https://github.com/tensorflow/tensorflow/issues/4181 in which danmane said it's 'tfrecord' that do that write file job. Then I dig into the code and find that the relevant C code tensorflow/core/lib/io/record_writer.cc and py_record_writer.cc, TensorFlow uses SWIG to make a wrapper to use them in Python.

I think it's too hard to write these in Python as it has so many dependencies, and it's not easy to use in other language, which means someone has to use Python interface for visualization purpose.

Can I just pull these related C files out, put it in core library and use SWIG or something else as a solution for Python interface? @piiswrong Could you give me some suggestions? What's your convention in writing a wrapper from C to Python?

mufeili commented 7 years ago

@zihaolucky tensorflow/tensorflow/core/lib/io/record_writer.cc is exactly where I got stuck at first. We then decided to use tf.python_io.TFRecordWriter for the time being.

zihaolucky commented 7 years ago

@mufeili @piiswrong

Good news, I've found that someone has already given a solution to write record_writer in python, check https://github.com/TeamHG-Memex/tensorboard_logger

I migrated the code to MXNet and it works, now we can use TensorBoard without relying on TF. So I've submitted the code to my branch https://github.com/zihaolucky/mxnet/tree/feature/tensorboard-support-experiment and please check it out.

mufeili commented 7 years ago

@zihaolucky Awesome! I've had a quick look at it. I think it currently only supports scalar summary as well so I am not sure if the record_writer function would still work when it comes to other kinds of summaries. But still lots of thanks!

zihaolucky commented 7 years ago

@mufeili Seems it could also support other types of summaries. As it writes a serialized event, but it just provides a scalar summary api.

https://github.com/TeamHG-Memex/tensorboard_logger/blob/master/tensorboard_logger/tensorboard_logger.py#L94-L103

terrytangyuan commented 7 years ago

Great work - exciting to see the progress! Note that you probably need to include the necessary copyright information if you borrow the code from some other project.

zihaolucky commented 7 years ago

@terrytangyuan Thanks for your kind reminder, I would do some research on the copyright issue.

piiswrong commented 7 years ago

Since mxnet as tf are both using apache it should be fine. Retaining the author comment in the begining of each file should be enough

On Sat, Dec 10, 2016 at 8:06 PM, Zihao Zheng notifications@github.com wrote:

@terrytangyuan https://github.com/terrytangyuan Thanks for your kind reminder, I would do some research on the copyright issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4003#issuecomment-266259370, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiudDQspmBbKcGjSlxguiCYcwABDecJks5rG3aogaJpZM4K9NFV .

tqchen commented 7 years ago

It would be necessary to copy the LICENSE file from original repo, and retaining the copy right notice

zihaolucky commented 7 years ago

Update, we now provide a PyPI package for TensorBoard fans :)

https://github.com/dmlc/tensorboard/issues/19

bravomikekilo commented 7 years ago

I make a standalone tensorboard by extract tensorboard's C++ dependency from TensorFlow. So, we don't need to build whole TensorFlow now. Meanwhile, we can use the TensorFlow file system API from python through this reduced tensorflow library. barvomikekilo/mxconsole

zihaolucky commented 7 years ago

@bravomikekilo great work! Any plan to ship to dmlc/tensorboard? And I believe you have to make it easier to maintain, as tensorboard might change very often and new features keep coming in(as they said in TF Dev Summit, they're going to provide more flexible plugin module for tensorboard developers). So I focus on logging part and try not to change the rendering part.

Just my personal opinion.

bravomikekilo commented 7 years ago

I mostly keep the structure of the tensorboard project, I going to fix the structure same as the offical tensorflow, so we can sync the change. I have enable the logging support from C++, so It will be much faster and reliable.

bravomikekilo commented 7 years ago

Or maybe we should merge dmlc/tensorboard to mxconsole? As the most tensorboard function can enabled from reduced tensorflow. Meanwhile, we can split the mxconsole to smaller module. The reduced tensorflow can do much more things.

zihaolucky commented 7 years ago

@piiswrong @jermainewang any thoughts?

bravomikekilo commented 7 years ago

Already merged dmlc/tensorboard to bravomikekilo/mxconsole, include native library powered summary api and tutorial. The tutorial works fine now. I will clean up the project tomorrow.

piiswrong commented 7 years ago

What's the benefit of extracting the code vs cloning tensorflow?

bravomikekilo commented 7 years ago

The library is smaller and easy to build.

bravomikekilo commented 7 years ago

meanwhile smaller code base is much more clear and portable. now the reduced tensorflow(tensorflow_fs) only contain about 300 source file, while origin TensorFlow contains about 7000 source file. tensorflow_fs keep the same project structure as TensorFlow, so it should be easy to sync the change.

bravomikekilo commented 7 years ago

@piiswrong

zihaolucky commented 7 years ago

The native library for potential more language interface support seems a good idea, while the maintainers still have to write a wrapper, same workload as they write logging interface in Scala or any other languages.

I encourage you to propose your roadmap toward this direction by extracting the code, then point out some promising benefits, if not, spending times on 10% difference while the 90% same is not a good idea.

bravomikekilo commented 7 years ago

Ok, I will try to add back the interface file for go and java from origin TensorFlow. Scala can use java interface easily, right? R, julia, js, and matlab only has protobuf library, so we need to write logging part. for R, there are swig support, we should only need to change a few swig file, to add native support. for julia, matlab, we may need use the C interface. @zihaolucky

bravomikekilo commented 7 years ago

Besides, the native library provide a faster implementation of crc32 and protobuf writing, and it is possible to merge the native png encode support. And considering the difference between TensorFlow and Mxnet , graph and embedding render and logger may change a lot, without a standalone tensorboard, it may be hard to achieve.

bravomikekilo commented 7 years ago

A sad story is that java and go interface don't have summary or something, maybe they just add the summary ops to the graph. Seems all logger still need to be write.

zihaolucky commented 7 years ago

Consider focusing on logging.

bravomikekilo commented 7 years ago

I can just extract only logging part, that is much smaller

bravomikekilo commented 7 years ago

maybe we should split the logging and rendering? Now the only problem of rendering is just graph and embedding.

bravomikekilo commented 7 years ago

so, make a sum up. we have three way to go.

  1. fix the python code in dmlc/tensorboard
    • easy to achieve
  2. fix the python code in dmlc/tensorboard to use the native library in tensorboard build
    • easy to achieve
    • with native support
  3. find a way to keep change between bravomikekilo/mxconsole and tensorflow/tensorflow

and a optional choose is to split the tensorflow_fs from mxconsole. that will easier to keep sync

RogerBorras commented 7 years ago

@zihaolucky @bravomikekilo Are you planning to port Tensorboard to mxnetR binding?? It will be great!! :)

bravomikekilo commented 7 years ago

I not good at R, but I will try. It shouldn't be too hard.

RogerBorras commented 7 years ago

Great, thanks a lot @bravomikekilo!

lichen11 commented 7 years ago

@bravomikekilo, @zihaolucky , @RogerBorras @thirdwing , it will be great if there will be a visualization board for mxnetR!

zihaolucky commented 7 years ago

@lichen11 @bravomikekilo If you can figure out the way to write the event file and the summary protobufs in R, then it could be achieved. Just refer the https://github.com/dmlc/tensorboard/tree/master/python and ping me if you need any help.