Multi-GPUs in tensorflow

hixiaye commented 5 years ago

HI @dstamoulis Thanks for your code! I have modified TPU setting into GPU, like tf.estimator.Estimator, tf.estimator.RunConfig, and single GPU works. However, when I apply "MirroredStrategy" into tf.estimator.RunConfig for multi-gpus, it can not work. The Error is: I0514 20:11:40.999713 139768726693632 tf_logging.py:115] Error reported to Coordinator: Traceback (most recent call last): File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 783, in run self.main_result = self.main_fn(*self.main_args, self.main_kwargs) File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1168, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/data/project/tensorflow/FACE/SinglePath_NAS/single-path-nas-master_multi_gpus/nas-search/search_main.py", line 361, in nas_model_fn train_op = ema.apply(ema_vars) File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py", line 431, in apply self._averages[var], var, decay, zero_debias=zero_debias)) File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py", line 84, in assign_moving_average with ops.colocate_with(variable): File "/usr/local/miniconda3/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4092, in _colocate_with_for_gradient with self.colocate_with(op, ignore_existing): File "/usr/local/miniconda3/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4144, in colocate_with op = internal_convert_to_tensor_or_indexed_slices(op, as_ref=True).op File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1305, in internal_convert_to_tensor_or_indexed_slices value, dtype=dtype, name=name, as_ref=as_ref) File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1144, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/values.py", line 447, in _tensor_conversion_mirrored assert not as_ref AssertionError

Any help would be appreciated, thank you!

fabbrimatteo commented 5 years ago

Hi @sxs11, I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

hixiaye commented 5 years ago

Hi @sxs11, I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

tf.contrib.tpu.TPUEstimatorSpec() -> tf.estimator.EstimatorSpec() tf.contrib.tpu.RunConfig() -> tf.estimator.RunConfig() tf.contrib.tpu.TPUEstimator() -> tf.estimator.Estimator() other points: I delete the flags: 'use_tpu', 'tpu', 'gcp_project','tpu_zone' and set 'data_dir' default=None. (I just use the fake data for debug)

I use MirroredStrategy() for multi-gpus: NUM_GPUS = 2 distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS) gpu_options = tf.GPUOptions(allow_growth=True) session_config = tf.ConfigProto(gpu_options=gpu_options)

distribution and session_config are arguments of tf.estimator.RunConfig()

fabbrimatteo commented 5 years ago

I solved by removing the moving_average_decay: default=0.

It seems that moving_average_decay is not compatible with Multi-GPU training

iamweiweishi commented 5 years ago

@sxs11 @fabbrimatteo Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the parameter 'host_call', how to handle the problem? Many thanks.

QueeneTam commented 5 years ago

I solved by removing the moving_average_decay: default=0.

It seems that moving_average_decay is not compatible with Multi-GPU training

Hello, I encounter this problem when I want to reproduce this work. Can you share your code? It would be very appreciated! queene_tam@163.com is my email. Thanks a lot!

QueeneTam commented 5 years ago

Hi @sxs11, I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

tf.contrib.tpu.TPUEstimatorSpec() -> tf.estimator.EstimatorSpec() tf.contrib.tpu.RunConfig() -> tf.estimator.RunConfig() tf.contrib.tpu.TPUEstimator() -> tf.estimator.Estimator() other points: I delete the flags: 'use_tpu', 'tpu', 'gcp_project','tpu_zone' and set 'data_dir' default=None. (I just use the fake data for debug)

I use MirroredStrategy() for multi-gpus: NUM_GPUS = 2 distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS) gpu_options = tf.GPUOptions(allow_growth=True) session_config = tf.ConfigProto(gpu_options=gpu_options)

distribution and session_config are arguments of tf.estimator.RunConfig()

Hello, I encounter this problem when I want to reproduce this work. Can you share your code? It would be very appreciated! queene_tam@163.com is my email. Thanks a lot!

marsggbo commented 5 years ago

@sxs11 @fabbrimatteo Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the parameter 'host_call', how to handle the problem? Many thanks.

Hello, I find a way to solve this problem. By reading the source code of TPUEstimatorSpec, I find it has a function as_estimator_spec, so you can only make the following modification, then it will work for GPUs:

def model_fn():
    ...
    spec = TPUEstimatorSpec(
                ...
               host_call=host_call
               ...
        )
    return spec.as_estimator_spec

enyac-group / single-path-nas

Multi-GPUs in tensorflow #3