How to combine distributed Tensorflow with TensorFrames?

databricks / tensorframes

[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

Apache License 2.0

750 stars 162 forks source link

How to combine distributed Tensorflow with TensorFrames? #67

Open DjangoPeng opened 7 years ago

DjangoPeng commented 7 years ago

As we all know, TensorFlow v0.8 has supported built-in distributed training, but I can't find any work around combinations of them. So, guys, could you introduce your plan on that?

thunterdb commented 7 years ago

@DjangoPeng this is a good question. Since the computations run by TensorFlow are more general than the communication model followed by Spark, the plan is to enter the TensorFlow job, let TF take over the cluster as long as it needs to run some (distributed) calculations, and wait for the completion. This should not require particular support on the side of TensorFrames, but I have not looked into the details yet.

debasish83 commented 7 years ago

@thunterdb I am exposing the distributed_training and ps interfaces through javacpp-presets https://github.com/bytedeco/javacpp-presets/issues/398...Once exposed I believe it will be possible to create one long running job on spark cluster that construct the ps nodes and another tensorframe/sparknet job that pushes data through feed_dict to run distributed training using ps nodes...if you think it's a feasible direction, I can add the support to tensorframes. Could you please confirm that tensorframes scala/java code path does not have any python ser/deser involved since I would like to avoid python and directly use native tensorflow c++ code through javacpp ?

thunterdb commented 7 years ago

@debasish83 regarding the serialization issue, tensorframes completely bypasses python ser/deser (it is only used by the frontend to send tensorflow compute graph).

Regarding javacpp, the current version is trying to use the hand-written official bindings for tensorflow in java, since I found some notable improvements in performance, memory usage and stability. Note that for research purposes, you may want to fork tensorframes and use the javacpp integration code.

saudet commented 6 years ago

@thunterdb Could you provide more details about "notable improvements in performance, memory usage and stability"? The manually written JNI bindings simply make the same calls in C/C++ that we could otherwise do from Java with JavaCPP. There shouldn't be any differences, so I would like to understand what the issues are.

saudet commented 6 years ago

The only thing I can see really is that you are not manually deallocating memory when using JavaCPP, but you are when using the Java API of TensorFlow, because it's mandatory and if you don't it simply leaks memory, while at least JavaCPP gives you the option to fall back on the garbage collector. Relying on the GC however can be unreliable and reduces performance. That's what is probably happening here.

thunterdb commented 6 years ago

@saudet thank you for the comments. Note that my experience was based on early, unstable releases of tensorflow-javacpp, and the issues experienced back then were solved by switching to the official java bindings from the tensorflow project. I do not think more recent versions of tensorflow-javacpp have been tried since then. Also, with Spark, controlling the memory usage is usually very desirable because of the multiple systems that use off-heap memory, so relying on the GC was found to cause executors to be killed due to OOM errors.

saudet commented 6 years ago

That's my point though. We never had to rely on the GC with JavaCPP. It was always an option to manage the memory manually.

saudet commented 5 years ago

@thunterdb BTW, this got a lot more explicit in JavaCPP with PointerScope: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/