ValueError: saved weight shape issue

david-schweitzer commented 6 years ago

This is fantastic work, and your details on how to extend the problem to include more classes is perfect.

But I'm trying to train on 6 classes (7 including background), and I used the mask_rcnn_coco.h5 weights, and I run into the following error:

ValueError: Layer #389 (named "mrcnn_bbox_fc"), weight <tf.Variable 'mrcnn_bbox_fc/kernel:0' shape=(1024, 28) dtype=float32_ref> has shape (1024, 28), but the saved weight has shape (1024, 324).

How do I get around this? Ideally, I would like to train the model from scratch without loading weights, but I don't think I can using your code?

SUYEgit commented 6 years ago

Hi, thanks a lot for your comments.

I don't think it's the problem of weights. You are supposed to be able to train with coco weights. This case most possibly happens when you forget to modify NUM_CLASSES in class SurgeryConfig as I mentioned at the beginning of readme file. Please update me if you still got the problem.

Thank you for you implementing the codes.

david-schweitzer commented 6 years ago

Thank you for the response. So I did set the argument to weights=coco to download the correct weights h5 file. I've also modified the add classes to include things like

self.add_class("target", 1, "vehicleA") ... self.add_class("target", 6, "vehicleF")

I've set the "NUM_CLASSES = 1 + 6" for the 6 classes plus background

And I've modified the load_mask function to account for these:

if image_info["source"] != "target": return super(self.class, self).load_mask(image_id)

and

for i, p in enumerate(class_names): if p['name'] == 'VehicleA': class_ids[i] = 1 .... elif p['name'] == 'VehicleF' class_ids[i] = 6

And did the above also for the load_mask_hc function.

But I still have this issue.

SUYEgit commented 6 years ago

Hi, I am a bit confused with your statement. Is 'VehicleA' and 'VehicleF' the names you assigned to classes when labeling in VIA? If so, I can't see problem from the information you provided. My colleagues have also implemented codes on their dataset and we never see this hapenning.

Did you install tensorflow and keras with following the readme?

david-schweitzer commented 6 years ago

Correct. For reference, I am using TensorFlow 1.9.0 and Keras 2.2.0. Also, here is a snippet of the JSON file to show what VIA outputs (and kind of merge the other issue I created since I think these two are related).

"file.bmp": {"fileref": "", "size": 786432, "filename": "file.bmp", "base64_img_data": "", "file_attributes": {}, "regions": {"0": {"shape_attributes": {"name": "polygon", "all_points_x": [478, 610, 478], "all_points_y": [250, 533, 250]}, "region_attributes": {"name": "VehicleA", "target": 1}}}}}

The above is actually the LAST entry in a very, very large list of entries. I'm not sure if you have each image (entry) stores as an element of a list like

{[{"file1-image" : {VIA stuff}}, {"file2-image": {VIA stuff}}, ..., {"fileN-image": {VIA stuff}}]}

Or if you have them as keys, like {"file1-image": {VIA stuff}, ...., "fileN-image": {VIA stuff}} (this is what I have and what I've given an example of)

Your example in surgery.py, I think, only shows one single file in the JSON. Maybe you didn't concatenate all 400 (?) or so images of your own training set into a single JSON, and I'm mistaken?

A lot of the full-on errors I'm getting have been fixed after making the modifications discussed in this issue. I just have two other issues at the moment, both involving training.

If I train using the argument --coco, I no longer get the error originally stated in this issue. Instead, it reaches Epoch 1/10 (I just wanted 10 to make sure it worked), and then it hangs. I ran it over the entire weekend, no progress, no update to the logs directory. It correctly creates a logs directory and correctly downloads the coco weights, but does nothing.

For a reference, I am training on over 10k images of size 1024x768 that are bitmaps. So obviously I should employ using a GPU on some supercomputer. But to do that, I need to have the weights file already downloaded since our GPU nodes cannot download/upload or communicate to a network at all. That's fine--as I said, --coco correctly downloads the h5 file. I just move this somewhere and do --""path to h5""

But if I try and set the path to this specific h5 file, I get the issue. It correctly locates the h5 weights, and then tells me the ValueError I gave above.

EDIT: Just as an update, I'm going to retry this with PRECISELY your versions of Tensorflow and Keras. You'll need to actually edit your requirements.txt; higher versions of Keras do not use the topology import that model.py uses; all instances of "topology" must instead be replaced with "saving" to fix this issue (and I'm not sure how this affects performance; is it just a name change or is different machinery running in the background?). I've also had troubles in the past with moving between Tensorflow 1.5 to something higher and certain TF algorithms on recent versions not running at all on programs built with older versions. Since I still have issues reading in the weights h5 file on a GPU node, I'll run it all day on a CPU node with 5 steps per epoch and only 200 training images and update if anything new happens.

SUYEgit commented 6 years ago

For your first question: After labeling images in VIA, it could automatically export a json file which save the annotations of all images. And the format is like: {"Picture_1.jpg":{"fileref":"", "size":620823, "filename":"Picture 850.jpg","base64_img_data":"", "file_attributes":{}, "regions":{"0":{"shape_attributes":{"name":"polygon","all_points_x":[1129,1171,1179,1189,1187,1185,1151,1143,1133,1129], "all_points_y":[467,657,658,653,635,622,463,466,468,467]},"region_attributes":{"name":"arm"}}, "1":{shape_attributes":......},......}, "Picture_2.jpg:...., Picture_3.jpg:......}

I don't think you need to edit or look into detail of the json file. We have also tried with newest version of VIA and the format of json file stay unchanged.

SUYEgit commented 6 years ago

For second suggest: Yeah, someone also meets the same problem with Keras version which comes from the original implementation of matterport. I am also confused changing topology into saving would cause any performance difference. Maybe we can try to find similar case or post a issue at https://github.com/matterport/Mask_RCNN.

SUYEgit commented 6 years ago

For the case when training hang on the 1st iteration, we tried on different computer and never met that issue. Did you check you cuda installation? Are cuda drivers as well as cudnn all correctly installed? That's the only reason I could think of to hang the training.

SUYEgit commented 6 years ago

I am a bit confused with your EDIT: Do you mean you are able to train on a small dataset with CPU?

SUYEgit commented 6 years ago

I just found a typo in script visualize.py line 91 due to recent updates. The parameter "real_time" should be defaulted as "False". Is the codes you using include this update? If so, pls try to train after modifying that. I have also updated it on github.

david-schweitzer commented 6 years ago

So this is what I've done to try and address everything talked about.

1) I was indeed NOT using a GPU. That was my fault, but you'll also want to probably specify something in your readme or requirements: tensorflow and tensorflow-gpu are different in that one is CPU and other is GPU. I had to uninstall the CPU and install the GPU before I could get access to the GPU node (TF doesn't like both packages installed). Running "pip install tensorflow>=1.7.0" won't install the GPU TF. Fixing this to "pip install tensorflow-gpu>=1.7.0" works.

2) I've moved everything to a brand new laptop and verified that CUDA (9.0) and cuDNN (7.0.5 + three patch updates) are installed and the correct variables are set in the environment. surgery.py now prints out all the GPU information, so it correctly located the GPU node and is applying it.

3) I re-cloned your github with all the changes you've made and changed nothing except number of steps/epoch (down to 10) and number of epochs (down to 5) just to make sure I could get SOME results. This means I'm only looking for 2 classes: "arm" and "ring" (I just told all my annotations to call my targets "arm" or "ring" and made sure I was consistent).

4) surgery.py correctly downloads the mask_rcnn_coco.h5, finds the GPU, prints out all the backend information, finds the dataset (200 training, 10 validation).

Still hanging on Epoch 1/5. It definitely sounds like it's running. At this point I'm not too sure anymore. Running nvidia-smi.exe tells me that Python is definitely using my GPU node. Specifying weights=/mask_rcnn_coco.h5 is still giving me a ValueError:

File "surgery.py", line 520, in model.load_weights(weights_path, by_name=True) File "\mrcnn\model.py", line 2101, in load_weights topology.load_weights_from_hdf5_group_by_name(f, layers) File "C:\Anaconda3\lib\site-packages\keras\engine\topology.py", line 3468, in load_weights_from_hdf5_group_by_name K.batch_set_value(weight_value_tuples) File "C:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 2368, in batch_set_value assign_op = x.assign(assign_placeholder) File "C:\Anaconda3\lib\site-packages\tensorflow\python\ops\variables.py", line 609, in assign return state_ops.assign(self._variable, value, use_locking=use_locking) File "C:\Anaconda3\lib\site-packages\tensorflow\python\ops\state_ops.py", line 281, in assign validate_shape=validate_shape) File "C:\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 64, in assign use_locking=use_locking, name=name) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3292, in create_op compute_device=compute_device) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3332, in _create_op_helper set_shapes_for_outputs(op) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2496, in set_shapes_for_outputs return _set_shapes_for_outputs(op) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2469, in _set_shapes_for_outputs shapes = shape_func(op) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2399, in call_with_requiring return call_cpp_shape_fn(op, require_shape_fn=True) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 627, in call_cpp_shape_fn require_shape_fn) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 691, in _call_cpp_shape_fn_impl raise ValueError(err.message) ValueError: Dimension 1 in both shapes must be equal, but are 12 and 324. Shapes are [1024,12] and [1024,324]. for 'Assign_682' (op: 'Assign') with input shapes: [1024,12], [1024,324].

I'm PRETTY this error and the "hanging on Epoch 1/5" are actually the same; the weights don't seem to be wanting to load correctly.

uday1994 commented 6 years ago

ValueError: Dimension 1 in both shapes must be equal, but are 12 and 324. Shapes are [1024,12] and [1024,324]. for 'Assign_682' (op: 'Assign') with input shapes: [1024,12], [1024,324]...

im getting same error , what to do

umair-sabir commented 6 years ago

Hi,

I am getting this error when i start training. Is there something wrong with my json file or change in codes. I have two classes in my picture labelling. One is "yes and another is "normal". I just replaced your dataset with mine and changed the "arm" and "ring" names to "yes" and "normal". but still i am getting this error.

mrcnn_mask (TimeDistributed) /home/hdfsf16/.conda/envs/mask/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /home/hdfsf16/.conda/envs/mask/lib/python3.5/site-packages/keras/engine/training.py:1987: UserWarning: Using a generator with use_multiprocessing=True and multiple workers may duplicate your data. Please consider using thekeras.utils.Sequence class. UserWarning('Using a generator withuse_multiprocessing=True`' ERROR:root:Error processing image {'width': 1280, 'id': '15691.png', 'names': [{'type': 'yes'}], 'source': 'type', 'height': 1024, 'polygons': [{'all_points_y': [647, 648, 657, 657, 669, 666, 657, 652, 640, 615, 619, 641, 647], 'name': 'polygon', 'all_points_x': [661, 676, 686, 700, 695, 658, 651, 651, 653, 647, 658, 660, 661]}], 'path': 'data/surgery/train/15691.png'} Traceback (most recent call last): File "/home/hdfsf16/.conda/envs/mask/Surgery-Robot-Detection-Segmentation-master/Surgery-Robot-Detection-Segmentation-master/mrcnn/model.py", line 1696, in data_generator use_mini_mask=config.USE_MINI_MASK) File "/home/hdfsf16/.conda/envs/mask/Surgery-Robot-Detection-Segmentation-master/Surgery-Robot-Detection-Segmentation-master/mrcnn/model.py", line 1210, in load_image_gt mask, class_ids = dataset.load_mask(image_id) File "surgery.py", line 164, in load_mask class_names = info["type"] KeyError: 'type' ERROR:root:Error processing image {'width': 1280, 'id': '2103.png', 'names': [{'type': 'yes'}], 'source': 'type', 'height': 1024, 'polygons': [{'all_points_y': [661, 655, 651, 645, 639, 648, 657, 661], 'name': 'polygon', 'all_points_x': [657, 669, 685, 669, 650, 646, 646, 657]}], 'path': 'data/surgery/train/2103.png'} Traceback (most recent call last): File "/home/hdfsf16/.conda/envs/mask/Surgery-Robot-Detection-Segmentation-master/Surgery-Robot-Detection-Segmentation-master/mrcnn/model.py", line 1696, in data_generator use_mini_mask=config.USE_MINI_MASK) File "/home/hdfsf16/.conda/envs/mask/Surgery-Robot-Detection-Segmentation-master/Surgery-Robot-Detection-Segmentation-master/mrcnn/model.py", line 1210, in load_image_gt mask, class_ids = dataset.load_mask(image_id) File "surgery.py", line 164, in load_mask class_names = info["type"] KeyError: 'type'

eyildiz-ugoe commented 5 years ago

I get this error on prediction as well...

SUYEgit / Surgery-Robot-Detection-Segmentation

ValueError: saved weight shape issue #3