multi-gpu version try, but the program is always crashing

animebing commented 7 years ago

I am trying to modify the current code to make it run on multi-gpu based on the tensorflow cifar10 multi-gpu implementation. it seems to be simple from cifar10 example, but after I modify the code and run the new code, there occurs some different errors, one thing I want to point out here is that the modified code works well on single gpu(I mean no error occurs).

In train/train.py, I change function train to below(some unchanged parts is not shown here)

with tf.device("/cpu:0"):
        global_step = tf.get_variable(
            'global_step', [],
            initializer=tf.constant_initializer(0), trainable=False)

        image, ih, iw, gt_boxes, gt_masks, num_instances, img_id = \
            datasets.get_dataset(FLAGS.dataset_name,
                                FLAGS.dataset_split_name,
                                FLAGS.dataset_dir,
                                FLAGS.im_batch,
                                is_training=True)

        data_queue = tf.RandomShuffleQueue(capacity=32, min_after_dequeue=16,
                dtypes=(
                    image.dtype, ih.dtype, iw.dtype,
                    gt_boxes.dtype, gt_masks.dtype,
                    num_instances.dtype, img_id.dtype))
        enqueue_op = data_queue.enqueue((image, ih, iw, gt_boxes, gt_masks, num_instances, img_id))
        data_queue_runner = tf.train.QueueRunner(data_queue, [enqueue_op] * 4)
        tf.add_to_collection(tf.GraphKeys.QUEUE_RUNNERS, data_queue_runner)

        lr = _configure_learning_rate(82783, global_step)
        optimizer = _configure_optimizer(lr)

       tower_grads = []
        with tf.variable_scope(tf.get_variable_scope()):
            for i in xrange(FLAGS.num_gpus):
                with tf.device('/gpu:%d' % i):
                    with tf.name_scope('tower_%d' % i) as scope:
                        (image, ih, iw, gt_boxes, gt_masks, num_instances, img_id) =  data_queue.dequeue()

                        im_shape = tf.shape(image)
                        image = tf.reshape(image, (im_shape[0], im_shape[1], im_shape[2], 3))

                        ## network
                        logits, end_points, pyramid_map = network.get_network(FLAGS.network, image,
                                weight_decay=FLAGS.weight_decay)

                        outputs = pyramid_network.build(end_points, ih, iw, pyramid_map,
                                num_classes=81,
                                base_anchors=9,
                                is_training=True,
                                gt_boxes=gt_boxes, gt_masks=gt_masks, scope=scope,
                                loss_weights=[0.2, 0.2, 1.0, 0.2, 1.0])

                        total_loss = outputs['total_loss'] # total loss excludes regularization loss here
                        losses  = outputs['losses']
                        batch_info = outputs['batch_info']
                        regular_loss = tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))

                        summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)

                        tf.get_variable_scope().reuse_variables()

                        grads = solve_grads(optimizer, scope)
                        tower_grads.append(grads)

        grads = average_gradients(tower_grads)
        grad_updates = optimizer.apply_gradients(grads, global_step=global_step)

        update_ops = []
        update_ops.append(grad_updates)
        update_op = tf.group(*update_ops)

        summaries.append(tf.summary.scalar('learning_rate', lr))
        config = tf.ConfigProto()
        config.gpu_options.allow_growth = True
        config.allow_soft_placement = True
        sess = tf.Session(config=config)

Besides above, I add a new function average_gradients, which is below

def average_gradients(tower_grads):
    """Calculate the average gradient for each shared variable across all towers.
    Note that this function provides a synchronization point across all towers.
    Args:
        tower_grads: List of lists of (gradient, variable) tuples. The outer list
        is over individual gradients. The inner list is over the gradient
        calculation for each tower.
    Returns:
        List of pairs of (gradient, variable) where the gradient has been averaged
        across all towers.
    """
    average_grads = []
    for grad_and_vars in zip(*tower_grads):
        # Note that each grad_and_vars looks like the following:
        #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
        grads = []
        for g, tmp in grad_and_vars:
            # Add 0 dimension to the gradients to represent the tower.
            if g is None:  # None is from the bias of last class prediction layer, which is not included in loss
                break
            expanded_g = tf.expand_dims(g, 0)

            # Append on a 'tower' dimension which we will average over below.
            grads.append(expanded_g)

            # Average over the 'tower' dimension.
        if not grads:
            continue
        grad = tf.concat(axis=0, values=grads)
        grad = tf.reduce_mean(grad, 0)

        # Keep in mind that the Variables are redundant because they are shared
        # across towers. So .. we will just return the first tower's pointer to
        # the Variable.
        v = grad_and_vars[0][1]
        grad_and_var = (grad, v)
        average_grads.append(grad_and_var)
    return average_grads

when I run the modified code on 2 TITAN X, just after one iteration, there occurs different errors blow

failed to enqueue async me mset operation: CUDA_ERROR_INVALID_HANDLE
failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_INTERNAL_ERROR, Internal: Blas GEMM launch failed
Input to reshape is a tensor with 827904 values, but the requested shape has 0

@CharlesShang Do you have any idea about these errors, thank you

amirbar commented 7 years ago

Hi @animebing,

I'm not sure what is the exact error, but you definitely should make sure the line:

(image, ih, iw, gt_boxes, gt_masks, num_instances, img_id) =  data_queue.dequeue()

is out of the gpus loop. in addition, you should perform tf.split on the dequeued data according to the batch size. For each gpu you need to define the model with it's corresponding share of data.

Hope that helps - let me know if you have any trouble

Amir

animebing commented 7 years ago

@amirbar Thank you for your suggestion, I have some questions about your suggestion

Why should data_queue.dequeue() be out of gpus loop?
According to my understanding, the element in the data_queue is just one image information, not a batch with more than one image, so tf.split is not necessary
For each gpu, I am trying to dequeue one element for use, I mean the data in each gpu is different, so I can't understand "For each gpu you need to define the model with it's corresponding share of data"

amirbar commented 7 years ago

sorry, you are right.

data_queue.dequeue()

should create a different dequeue op for every gpu. I initially assumed this op is shared among all gpus.

Do you have any more information on the 3rd error? maybe a line number?

animebing commented 7 years ago

@amirbar The traceback is below

File "train/train.py", line 293, in <module>
    train()
  File "train/train.py", line 176, in train
    loss_weights=[0.2, 0.2, 1.0, 0.2, 1.0])
  File "train/../libs/nets/pyramid_network.py", line 531, in build
    is_training=is_training, gt_boxes=gt_boxes)
  File "train/../libs/nets/pyramid_network.py", line 278, in build_heads
    refine = slim.flatten(cropped_regions)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1226, in flatten
    outputs = array_ops.reshape(inputs, flat_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2510, in reshape
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 827904 values, but the requested shape has 45528058422

from above, the error seems to come from refine = slim.flatten(cropped_regions)， which is in https://github.com/CharlesShang/FastMaskRCNN/blob/master/libs/nets/pyramid_network.py#L273, because I have some lines of print statement in my running code, so the line number is different from that in above link

simaoh commented 7 years ago

@animebing, could you make multi-gpu to work?

animebing commented 7 years ago

@simaoh, it still can't work now

VisanXJ commented 6 years ago

@animebing Had same problem as your. I checked the input shape again and again but nothing found. It can run several times in the average_gradients function loop, but I don't know why.

CharlesShang / FastMaskRCNN

multi-gpu version try, but the program is always crashing #95