Any plans to support distributed TF?

MLnick commented 7 years ago

e.g. setting up ClusterSpec and related code. I guess there is quite a bit of work involved, but what are the basic steps involved in exposing new functionalities from the various TF C classes? Not actually certain if it is part of the officially exposed C API or not.

eaplatanios commented 7 years ago

If I'm not mistaken, that functionality is not exposed through the C API and is implemented on the Python side in the official TensorFlow API. If that's indeed the case, then we'd have to reimplement that functionality in Scala, but I haven't looked into that yet.

On Jul 25, 2017, 5:12 AM -0400, Nick Pentreath notifications@github.com, wrote:

e.g. setting up ClusterSpec and related code. I guess there is quite a bit of work involved, but what are the basic steps involved in exposing new functionalities from the various TF C classes? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

eaplatanios commented 6 years ago

@MLnick I've been working on adding support for distributed training and I'm going to release some code soon that works similar to the TF Estimators API, but is a bit more structured and strongly-typed whenever I saw fit. I have no way of testing the distributed training API, so if you're indeed interested in it I would really appreciate some help with testing.

Also, could you please provide an example of how you currently do distributed training so I can figure out if I already support that scenario or not?

Thanks! :)

MLnick commented 6 years ago

@eaplatanios I'd love to help test things out. I haven't done too much with distributed TF but I'd like to be able to try (a) data-parallel, (b) "data and model" parallel training, both with pure tensorflow_scala and embedded within say Spark.

Do you have a branch or other code to point to here?

eaplatanios commented 6 years ago

@MLnick I'm really sorry for not responding to this earlier. I'll soon push the final changes such that distributed TF is supported similar to the Python API. I guess the best way to go about this would be to work on the data-parallel example first. Could you provide me with a Python API-based example and see whether it's supported by my API?

MLnick commented 6 years ago

I'd suggest we can start by re-creating the data parallel example in the TensorFlow docs

eaplatanios commented 6 years ago

That sounds like a good idea. I pushed some changes that add support for (I think) everything needed to re-create that tutorial example. You can find Server at org.platanios.tensorflow.core.distributed.Server, ClusterConfig at org.platanios.tensorflow.config.ClusterConfig, and Estimator at org.platanios.tensorflow.api.learn.Estimator (I'll change the namespaces later on - after we confirm it's working - to make them accessible through tf.*). I don't have time to implement the example now because I need to focus on something else, but would you like to give it a shot? I would be happy to assist you, I unfortunately cannot devote too much time on this, this week.

debasish83 commented 6 years ago

@eaplatanios @MLnick distributed.Server are you planning to push it back to java API OR scala and java API will continue to have it's own jni ?

debasish83 commented 6 years ago

I am going to try the Server API out and see if I can open 2 PS and a set of worker nodes to run distributed training...but I still feel the core API should be java so that all JVM languages can have access to it...

debasish83 commented 6 years ago

Also the forwardprop and backward prop are they exposed in C API ? Python wrappers I believe directly call C++ code...

eaplatanios commented 6 years ago

@debasish83 The C API does not expose much other than the very core functionality of creating single ops and executing them. I have implemented the core backprop iteration in the org.platanios.tensorflow.api.ops.Gradients object. Then, the gradient of each op is defined in the respective object (e.g., in org.platanios.tensorflow.api.ops.Basic.Gradients).

I would very curious to see how your attempt at distributed training goes as I haven't gotten to test that yet. Let me know how that goes and if any problems come up. It would be great if you could contribute a small example script for your experiment once you're done. :)

eaplatanios commented 6 years ago

@debasish83 I agree with respect to the Java API, but I only need the Scala API currently and I don't have much free time to work on the Java side for now, given my research constraints. I'd be more than happy to assist with developing a Java interface to my API though. :)

debasish83 commented 6 years ago

@eaplatanios I am going through the gradient calc code but it's better if the logic is not duplicated at multiple places so that we don't need to fix the bugs at multiple places......Python gradient code is moving out to C++ API and if we can call the C/C++ API using scala, that will be much cleaner...Do we really have to expose all these details in scala ? I take a dataframe convert to TFRecord and after that I formulate the problem...we need to expose graph construction, forward, backward and optimizer API (not sure how ps tasks are being exposed) but these should be minimal...

debasish83 commented 6 years ago

Don't we want to run all the heavy compute in c++ ? say NN.scala from api/ops...plan is to expose all these compute on JVM ? native c++ compute should be more efficient...

debasish83 commented 6 years ago

Is it possible to add tensorflow as module and use the header files in Javah from the module ? The module can be updated with each release...the copy of header files will be difficult to maintain...Let me know and I can take a stab at it...in build, I will specify for example a set of header files to be used and rest of the flow stays similar...

eaplatanios commented 6 years ago

@debasish83 Regarding the gradient computation code, I do support using the TensorFlow C++ API thought the ccGradients method. However, note that a lot of ops are not currently supported for gradient computation in the official TensorFlow API and that's why I have my own implementation of the back-propagation code.

Also, regarding the heavy compute: no heavy computation is done on the Scala side. Everything you see is constructing a symbolic graph that gets executed by the native TensorFlow executor at a later point in time (e.g., using a Session).

Regarding the module comment, I'm not sure what you mean but if you want to clarify or give it a try I'm open to it. The header files I'm currently using are the same as those that would be generated by Javah though.

debasish83 commented 6 years ago

For the module, it will be nice to add tensorflow master as a git module in your project and then use the header files from directly from tensorflow_scala/tensorflow/core/.....right now I think headers from tensorflow master are getting copy pasted which will change over releases....idea is similar to sparkontensorflow where they added tensorflow as a module and exposed API over it...

eaplatanios commented 6 years ago

@debasish83 I see what you mean now. The problem is that there is no single directory with all the headers in the main tensorflow repository anyway and we would have to write a script to collect all *.h files anyway. Also, one file is modified which would need to be added to that script as a hack. I agree that it would be nice to have an automated way to do this but we need to find a elegant way to do so. I may work on this during the holidays break because I have a paper deadline on the 15th of December that I'm working on now and I'll be traveling for NIPS next week.

debasish83 commented 6 years ago

javah can run in the same folder where the header files are but the output of javah can be pushed to src/native/generated where we link these with the tensorflow library to compile the jni...that way your library is always in sync with one version of tensorflow headers...when new releases come, we pull the module...it's not a big change but I wanted to take your feedback....

Looks like official code does not call ccgradients but the code is implemented in python...over time hopefully scala and python both can call ccgradients...Another question I have is whether the RendezvousMgr.send/rcv API is exposed in tensorflow_scala...I suspect the underlying tensorflow c++ code calls it and perhaps it's not exposed as API layer but to create Variable, the API should get exposed somewhere...

eaplatanios commented 6 years ago

@debasish83 This could be useful and if you're willing to work on it I'd be happy to help (regarding the headers).

Regarding the ccgradients function call: I think it'll take a long time until that API supports gradients for arbitrary ops that I use in my code. Currently my implementation is much more generic and supports most of the ops I expose through my API, including control flow ops. If the CC gradients functionality is ever extended, I already support calling it through my gradients module, so no changes would need to be made other than renaming that function so all calls to gradients default to that. The RendezvousMgr.send/rcv API is not exposed because you should never need to use it. It's used by the TF runtime and as long as you specify what devices you want to execute your ops on, the necessary send/rcv ops will be created for you automatically.

eaplatanios commented 6 years ago

I'm closing this due to inactivity. I have already started porting the new distribute API that was recently added to the TensorFlow Python API. It's located in org.platanios.tensorflow.api.ops.training.distribute.

debasish83 commented 6 years ago

@eaplatanios I have moved to python API and integrate pyspark with tensorflow API. If we can move core logic like gradient calc etc from python to c++ API, that should help scala/java API as well.

eaplatanios commented 6 years ago

True although we already have support for probably all the gradients you’d need in the Scala API.

eaplatanios / tensorflow_scala

Any plans to support distributed TF? #17