'No gradient registered' exception while running Resnet50

ambujpd commented 7 years ago

In the latest code, I'm getting the following error:

INFO:data_parallel_model:Model for GPU: 0
INFO:data_parallel_model:Adding gradient operators
Traceback (most recent call last):
  File "resnet50_trainer.py", line 454, in <module>
    main()
  File "resnet50_trainer.py", line 450, in main
    Train(args)
  File "resnet50_trainer.py", line 305, in Train
    optimize_gradient_memory=True,
  File "/nfs/AMBUJ/Caffe2_8/caffe2-master/build/caffe2/python/data_parallel_model.py", line 140, in Parallelize_GPU
    _AddGradientOperators(devices, model_helper_obj, losses_by_gpu)
  File "/nfs/AMBUJ/Caffe2_8/caffe2-master/build/caffe2/python/data_parallel_model.py", line 431, in _AddGradientOperators
    model.AddGradientOperators(loss_grad)
  File "/nfs/AMBUJ/Caffe2_8/caffe2-master/build/caffe2/python/model_helper.py", line 229, in AddGradientOperators
    self.grad_map = self.net.AddGradientOperators(*args, **kwargs)
  File "/nfs/AMBUJ/Caffe2_8/caffe2-master/build/caffe2/python/core.py", line 1626, in AddGradientOperators
    self._net.op[skip:], ys)
  File "/nfs/AMBUJ/Caffe2_8/caffe2-master/build/caffe2/python/core.py", line 1011, in GetBackwardPass
    return ir.GetBackwardPass(ys)
  File "/nfs/AMBUJ/Caffe2_8/caffe2-master/build/caffe2/python/core.py", line 891, in GetBackwardPass
    forward_op_idx, all_input_to_grad)
  File "/nfs/AMBUJ/Caffe2_8/caffe2-master/build/caffe2/python/core.py", line 841, in _GenerateGradientsForForwardOp
    forward_op, g_output)
  File "/nfs/AMBUJ/Caffe2_8/caffe2-master/build/caffe2/python/core.py", line 985, in GetGradientForOp
    "Exception from creating the gradient op: {}.".format(e))
Exception: No gradient registered for Scale. Exception from creating the gradient op: get_gradient_defs(): incompatible function arguments. The following argument types are supported:
    1. (arg0: unicode, arg1: List[caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper]) -> Tuple[List[str], List[caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper]]

Invoked with: '\n\ngpu_0/loss\x12\x14gpu_0/resnet50/Scale\x1a\x00"\x05Scale*\x0c\n\x05scale\x15\x00\x00\x80?2\x04\x08\x01\x10\x00', [<caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f76a066f750>].

@akyrola @salexspb

salexspb commented 7 years ago

When was the last time you rebased? I remember @akyrola fixed some bug that sounds similar.

lukeyeager commented 7 years ago

Which commit fixed this? I've been seeing a similar intermittent bug. My theory was that a GradientWrapper occasionally ends up where an Operator is expected.

ambujpd commented 7 years ago

@salexspb The problem persists with the latest commit as well. I've tried just now.

akyrola commented 7 years ago

I have never seen this bug. It looks to me like some Python2/3 compatibility issue involving string/bytes/Unicode. Ambuj, can you tell me about your setup (mac/linux, python version etc.)?

From: Ambuj notifications@github.com Reply-To: caffe2/caffe2 reply@reply.github.com Date: Wednesday, June 7, 2017 at 4:32 AM To: caffe2/caffe2 caffe2@noreply.github.com Cc: Aapo Kyrola akyrola@fb.com, Mention mention@noreply.github.com Subject: Re: [caffe2/caffe2] 'No gradient registered' exception while running Resnet50 (#700)

@salexspbhttps://github.com/salexspb The problem persists with the latest commit as well. I've tried just now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/caffe2/caffe2/issues/700#issuecomment-306767880, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABgM17KJ6Q2mGoUiuMHYqOlPDqsKWlcjks5sBoo_gaJpZM4NrX9-.

ambujpd commented 7 years ago

I'm using Linux (Ubuntu 14.04) and Python 2.6.

This was working for me 3-4 weeks ago, I don't know which after which commit point this issue started.

akyrola commented 7 years ago

We just recently did bunch of changes for python 3 compatibility. Would not be surprised if that breaks python2.6. Try a newer python?

Sent from my iPhone

On Jun 7, 2017, at 16:50, Ambuj notifications@github.com<mailto:notifications@github.com> wrote:

I'm using Linux (Ubuntu 14.04) and Python 2.6.

This was working for me 3-4 weeks ago, I don't know which after which commit point this issue started.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/caffe2/caffe2/issues/700#issuecomment-306818884, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABgM11bEZ-NwKj9zfCLbCKmahx6X9nvAks5sBribgaJpZM4NrX9-.

Giszy commented 7 years ago

When I did the toy regression (see:https://github.com/caffe2/caffe2/blob/master/caffe2/python/tutorials/Toy_Regression.ipynb) ,this bug happened. My caffe2 package is built in windows 10 x64, python 2.7/3.5/3.6. and the bug exists on three versions. But I had never seen this bug weeks before.

tomdz commented 7 years ago

@talk2ap001 Python 2.6 has been end-of-lifed in 2013 (2.7 is on its way out already), so please try with 2.7 or 3.5/3.6.

tomdz commented 7 years ago

@Giszy I have not been able to reproduce this error with the Toy Regression. Could you export the HTML of your run and attach it to this issue ?

tomdz commented 7 years ago

@talk2ap001 Could you give us some example code that reproduces this issue ?

ambujpd commented 7 years ago

Sorry, I meant to say 2.7.6.

I can only try it out with Python 3 tomorrow. I didn't follow the commits and was aware that only Python 2.7 is supported as is mentioned in the docs.

Giszy commented 7 years ago

Good. @tomdz Created init net. //--------------------------------------------------------------------------- Exception Traceback (most recent call last)

in () 44 45 # Get gradients for all the computations above. ---> 46 gradient_map = train_net.AddGradientOperators([loss]) 47 graph = net_drawer.GetPydotGraph(train_net.Proto().op, "train", rankdir="LR") 48 display.Image(graph.create_png(), width=800) c:\python27_x64\lib\site-packages\caffe2\python\core.pyc in AddGradientOperators(self, ys, skip) 1624 1625 grad_ops, input_to_grad = GradientRegistry.GetBackwardPass( -> 1626 self._net.op[skip:], ys) 1627 # Check if in immediate mode: the grad_ops are actually being produced 1628 # by C++ and bypasses the CreateOperator() call, so in immediate mode c:\python27_x64\lib\site-packages\caffe2\python\core.pyc in GetBackwardPass(cls, operators, ys, ys_generate_gradient) 1009 """ 1010 ir = IR(operators) -> 1011 return ir.GetBackwardPass(ys) 1012 1013 c:\python27_x64\lib\site-packages\caffe2\python\core.pyc in GetBackwardPass(self, ys) 889 for forward_op_idx in reversed(range(len(self.ssa))): 890 input_to_grad, gradient_ops = self._GenerateGradientsForForwardOp( --> 891 forward_op_idx, all_input_to_grad) 892 all_input_to_grad.update(input_to_grad) 893 all_gradient_ops += gradient_ops c:\python27_x64\lib\site-packages\caffe2\python\core.pyc in _GenerateGradientsForForwardOp(self, forward_op_idx, input_to_grad) 839 if not all(g is None for g in g_output): 840 gradient_ops, g_input = GradientRegistry.GetGradientForOp( --> 841 forward_op, g_output) 842 # Check if the gradient operators are legal, and update 843 # gradient_generators and gradient_frontier c:\python27_x64\lib\site-packages\caffe2\python\core.pyc in GetGradientForOp(cls, op, g_output) 983 raise Exception( 984 "No gradient registered for {}. ".format(op.type) + --> 985 "Exception from creating the gradient op: {}.".format(e)) 986 987 if gradient_ops is None: Exception: No gradient registered for GaussianFill. Exception from creating the gradient op: get_gradient_defs(): incompatible function arguments. The following argument types are supported: 1. (arg0: unicode, arg1: List[caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper]) -> Tuple[List[str], List[caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper]] Invoked with: '\x12\x01X\x1a\x00"\x0cGaussianFill*\n\n\x03std\x15\x00\x00\x80?*\x0c\n\x08run_once\x18\x00*\x0b\n\x05shape0@0\x02*\x0b\n\x04mean\x15\x00\x00\x00\x00', [].

lukeyeager commented 7 years ago

Tests which (sometimes) exhibit this behavior:

caffe2/python/optimizer_test.py::TestSgd::testSparse

Exception: No gradient registered for AveragedLoss. Exception from creating the gradient op: 'caffe2.python.caffe2_pybind11_state_gpu.Workspace' object has no attribute 'is_empty'.
caffe2/python/python_op_test.py::PythonOpTest::test_gradient

E0531 21:27:41.782681 337 pybind_state.cc:190] Exception encountered running PythonOp function: <type 'exceptions.AttributeError'>: 'caffe2.python.caffe2_pybind11_state_gpu.Workspace' object has no attribute 'shape'

tomdz commented 7 years ago

@lukeyeager I've ran both of these in a loop without any issues. Could you give me your environment specs (OS, python version, C++ compiler) and the commandline you were using to run these tests ?

lukeyeager commented 7 years ago

@tomdz I never can reproduce the failures locally either (otherwise I would have submitted a fix). These are all in our automated build+test environments. I shared them in hopes that the error messages would prompt one of the C2 devs to figure out the fix intuitively.

tomdz commented 7 years ago

@Giszy sorry, that is very hard to read, did you drag and drop the HTML file that you got from File -> Download as -> HTML in the IPython notebook, into the input box ?

tomdz commented 7 years ago

@lukeyeager Sure, but can you get me the env spec and commandline used in your CI env ?

lukeyeager commented 7 years ago

Command-line is pretty simple, just pytest -v $FILENAME.

Environment is not simple (can share more details elsewhere). But it's not horribly broken or anything. All the Ctests and most of the Python tests are regularly passing.

Giszy commented 7 years ago

OK. @tomdz

G.zip

hset911 commented 7 years ago

when i run "python muti_gpu_train.py",this bug also be here. are you sure fix this bug? Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file. Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file. Data folder found at /home/ubuntu/caffe2_notebooks/tutorial_data/resnet_trainer Traceback (most recent call last): File "muti_gpu_train.py", line 177, in blobs_to_gradients = train_model.AddGradientOperators(losses) File "/usr/local/caffe2/python/model_helper.py", line 251, in AddGradientOperators self.grad_map = self.net.AddGradientOperators(*args, **kwargs) File "/usr/local/caffe2/python/core.py", line 1626, in AddGradientOperators self._net.op[skip:], ys) File "/usr/local/caffe2/python/core.py", line 1011, in GetBackwardPass return ir.GetBackwardPass(ys) File "/usr/local/caffe2/python/core.py", line 891, in GetBackwardPass forward_op_idx, all_input_to_grad) File "/usr/local/caffe2/python/core.py", line 841, in _GenerateGradientsForForwardOp forward_op, g_output) File "/usr/local/caffe2/python/core.py", line 985, in GetGradientForOp "Exception from creating the gradient op: {}.".format(e)) Exception: No gradient registered for SpatialBN. Exception from creating the gradient op: get_gradient_defs(): incompatible function arguments. The following argument types are supported:

(arg0: unicode, arg1: List[caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper]) -> Tuple[List[str], List[caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper]]

Invoked with: '\n\x18imonaboat/comp_15_conv_3\n\x1cimonaboat/comp_15_spatbn_3_s\n\x1cimonaboat/comp_15_spatbn_3_b\n\x1dimonaboat/comp_15_spatbn_3_rm\n\x1eimonaboat/comp_15_spatbn_3_riv\x12\x1aimonaboat/comp_15_spatbn_3\x12\x1dimonaboat/comp_15_spatbn_3_rm\x12\x1eimonaboat/comp_15_spatbn_3_riv\x12\x1dimonaboat/comp_15_spatbn_3_sm\x12\x1eimonaboat/comp_15_spatbn_3_siv\x1a\x00"\tSpatialBN\x0e\n\x07epsilon\x15o\x12\x83:\x1b\n\x17cudnn_exhaustive_search\x18\x00\x0b\n\x07is_test\x18\x00\r\n\tuse_cudnn\x18\x01\r\n\x05order"\x04NCHW\x0f\n\x08momentum\x15\xcd\xcc\xcc=2\x04\x08\x01\x10\x00', [<caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b67e0>, <caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b6810>, <caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b6840>, <caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b6870>, <caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b68a0>].

---------------I wish this bug could be fix at once.

lukeyeager commented 7 years ago

Another slight variation (testing on 02ec758e154e66def63d90700d88f234446bc599):

caffe2/python/python_op_test.py::PythonOpTest::test_gradient_multiple_with_indicies

E0608 15:48:40.626631 363 pybind_state.cc:190] Exception encountered running PythonOp function: <type 'exceptions.AttributeError'>: 'caffe2.python.caffe2_pybind11_state_gpu.Workspace' object has no attribute 'reshape'

salexspb commented 7 years ago

Hey @tomdz , let me know if you have any questions about this Caffe2 stuff, thank you for looking into this!

tomdz commented 7 years ago

So I was looking into this some more and don't really see a reason why this would fail for you but not for me locally - other than changes in pybind11. What pybind11 version are you using ?

lukeyeager commented 7 years ago

I'm just using the version in third_party/.

tomdz commented 7 years ago

Can you try again with https://github.com/caffe2/caffe2/commit/2e2d00f161f874ccbdea1ac209352e31a86e133e ?

Giszy commented 7 years ago

@tomdz, my problem has been resolved. I think it was indeed caused by a unicode string.

ambujpd commented 7 years ago

@tomdz This resolves the issue for me as well. :+1:

lukeyeager commented 7 years ago

Tested before your change (2f49ef37): 7/7 passed Tested after your change (2e2d00f1): 5/7 passed

Presumably the first pipeline succeeded by luck, since the failures are intermittent. The jobs which failed still had the same error: Exception: No gradient registered for AveragedLoss. Exception from creating the gradient op: 'caffe2.python.caffe2_pybind11_state_gpu.Workspace' object has no attribute 'is_empty'.

Your change may still be helpful, but it didn't seem to fix the issue we're seeing.

tomdz commented 7 years ago

@lukeyeager I think your issue is a different one. Could you file a separate task for it ?

lukeyeager commented 7 years ago

Sure!

lukeyeager commented 7 years ago

Made a new issue at https://github.com/caffe2/caffe2/issues/793.

facebookarchive / caffe2

'No gradient registered' exception while running Resnet50 #700