Open animebing opened 7 years ago
Hi @animebing,
I'm not sure what is the exact error, but you definitely should make sure the line:
(image, ih, iw, gt_boxes, gt_masks, num_instances, img_id) = data_queue.dequeue()
is out of the gpus loop. in addition, you should perform tf.split on the dequeued data according to the batch size. For each gpu you need to define the model with it's corresponding share of data.
Hope that helps - let me know if you have any trouble
Amir
@amirbar Thank you for your suggestion, I have some questions about your suggestion
data_queue.dequeue()
be out of gpus loop?data_queue
is just one image information, not a batch with more than one image, so tf.split
is not necessarysorry, you are right.
data_queue.dequeue()
should create a different dequeue op for every gpu. I initially assumed this op is shared among all gpus.
Do you have any more information on the 3rd error? maybe a line number?
@amirbar The traceback is below
File "train/train.py", line 293, in <module>
train()
File "train/train.py", line 176, in train
loss_weights=[0.2, 0.2, 1.0, 0.2, 1.0])
File "train/../libs/nets/pyramid_network.py", line 531, in build
is_training=is_training, gt_boxes=gt_boxes)
File "train/../libs/nets/pyramid_network.py", line 278, in build_heads
refine = slim.flatten(cropped_regions)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1226, in flatten
outputs = array_ops.reshape(inputs, flat_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2510, in reshape
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 827904 values, but the requested shape has 45528058422
from above, the error seems to come from refine = slim.flatten(cropped_regions)
, which is in https://github.com/CharlesShang/FastMaskRCNN/blob/master/libs/nets/pyramid_network.py#L273, because I have some lines of print
statement in my running code, so the line number is different from that in above link
@animebing, could you make multi-gpu to work?
@simaoh, it still can't work now
@animebing Had same problem as your. I checked the input shape again and again but nothing found. It can run several times in the average_gradients function loop, but I don't know why.
I am trying to modify the current code to make it run on multi-gpu based on the tensorflow cifar10 multi-gpu implementation. it seems to be simple from cifar10 example, but after I modify the code and run the new code, there occurs some different errors, one thing I want to point out here is that the modified code works well on single gpu(I mean no error occurs).
In
train/train.py
, I change functiontrain
to below(some unchanged parts is not shown here)Besides above, I add a new function
average_gradients
, which is belowwhen I run the modified code on 2 TITAN X, just after one iteration, there occurs different errors blow
@CharlesShang Do you have any idea about these errors, thank you