Closed scarrazza closed 4 years ago
I can confirm that 1 is true, by adding to the lepage example the call
tf.debugging.set_log_device_placement(True)
and observing that the log never places an operator on GPU:1 but always on GPU:0 (even if nvidia-smi
says that the program is using memory from GPU:1).
Concerning points 2 and 3, I think the best tf-like approach is to do something like this:
@tf.function
def run():
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
strategy.experimental_run_v2(vegas(lepage, dim, n_iter, ncalls))
i.e. using the tf.distribute
API.
A possibility, using MirroredStrategy
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy?version=stable, just breaks the integration into equal chunks. Not very useful for distributing. If we want to do it correctly we need to implement something not very far from one of the ones here http://jakascorner.com/blog/2016/06/omp-for-scheduling.html#the-scheduling-types
Which means creating our own strategy.
There some projects like https://github.com/horovod/horovod which may help.
Let's have a look. I've been reading more into the Tensorflow distribution strategies and it seems only the keras distribution is implemented and in order to use we have to tie our hands way too much imho.
I think it is better if we deal with it in our own terms for now (and actually don't take it into consideration for the rest of the code) because we can always go to the parallel/joblib strategy.
In view of the great shape of #17 , I think we should consider the possibility to inherit from VegasFlow
some extra classes which implement specific distribution techniques such as:
VegasFlow
, default single GPU.TPEVegasFlow
, using the ThreadPoolExecutor
from concurrent.future
or whatever.SparkVegasFlow
, using Apache Spark.MPIVegasFlow
using openMPI.
After a more careful reading of https://www.tensorflow.org/guide/gpu, and testing some operators like
matmul
with large matrices, I realized that TF doesn't use all available GPUs automatically, i.e.:Thus for this project we have to consider splitting manually the
n_iter
orn_calls
across the availabletf.devices
so we may get a factor nGPUs faster. We may also consider adding the CPU0 together with GPUs.So need to:
vegas