Open xxw345 opened 8 years ago
Greetings! Did you find a fix to your issue?
Yes, I didn't find why I fix the issue. But the way I fixed it is just copying all the training data used in one machine to another machine, and do the identical training again on the machine I want to use for large throughput. I don't know why, but it looks the problem is from the data type transformation of batch normalization layer from nvidia branch.
Thanks. I was wanting to avoid doing that, but looks like I don't have that option.
I am facing the same problem and try to find out the solution.
I got two machine working with Caffe. One is a single machine with 3 GPUs (k20), which I used for trained and fine-tuning the model. Another is a GPU cluster used for large output.
I recently trained a model and test output on the k20 machine, which works great.
But when I copy the trained model to the GPU cluster, strange things happen. It gave the error about the size mismatch between the net definition in prototxt and trained weights:
This problem might be reported to nvidia caffe, but there is not active like blvc. So maybe anyone from here could provide some insights.
Is there some methods of python wrapper that could print the details about the trained model, like the shape and size of the weights. This will provide me a way to check where the size changed between two of my machine.