Open ambujpd opened 7 years ago
When was the last time you rebased? I remember @akyrola fixed some bug that sounds similar.
Which commit fixed this? I've been seeing a similar intermittent bug. My theory was that a GradientWrapper occasionally ends up where an Operator is expected.
@salexspb The problem persists with the latest commit as well. I've tried just now.
I have never seen this bug. It looks to me like some Python2/3 compatibility issue involving string/bytes/Unicode. Ambuj, can you tell me about your setup (mac/linux, python version etc.)?
From: Ambuj notifications@github.com Reply-To: caffe2/caffe2 reply@reply.github.com Date: Wednesday, June 7, 2017 at 4:32 AM To: caffe2/caffe2 caffe2@noreply.github.com Cc: Aapo Kyrola akyrola@fb.com, Mention mention@noreply.github.com Subject: Re: [caffe2/caffe2] 'No gradient registered' exception while running Resnet50 (#700)
@salexspbhttps://github.com/salexspb The problem persists with the latest commit as well. I've tried just now.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/caffe2/caffe2/issues/700#issuecomment-306767880, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABgM17KJ6Q2mGoUiuMHYqOlPDqsKWlcjks5sBoo_gaJpZM4NrX9-.
I'm using Linux (Ubuntu 14.04) and Python 2.6.
This was working for me 3-4 weeks ago, I don't know which after which commit point this issue started.
We just recently did bunch of changes for python 3 compatibility. Would not be surprised if that breaks python2.6. Try a newer python?
Sent from my iPhone
On Jun 7, 2017, at 16:50, Ambuj notifications@github.com<mailto:notifications@github.com> wrote:
I'm using Linux (Ubuntu 14.04) and Python 2.6.
This was working for me 3-4 weeks ago, I don't know which after which commit point this issue started.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/caffe2/caffe2/issues/700#issuecomment-306818884, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABgM11bEZ-NwKj9zfCLbCKmahx6X9nvAks5sBribgaJpZM4NrX9-.
When I did the toy regression (see:https://github.com/caffe2/caffe2/blob/master/caffe2/python/tutorials/Toy_Regression.ipynb) ,this bug happened. My caffe2 package is built in windows 10 x64, python 2.7/3.5/3.6. and the bug exists on three versions. But I had never seen this bug weeks before.
@talk2ap001 Python 2.6 has been end-of-lifed in 2013 (2.7 is on its way out already), so please try with 2.7 or 3.5/3.6.
@Giszy I have not been able to reproduce this error with the Toy Regression. Could you export the HTML of your run and attach it to this issue ?
@talk2ap001 Could you give us some example code that reproduces this issue ?
Sorry, I meant to say 2.7.6.
I can only try it out with Python 3 tomorrow. I didn't follow the commits and was aware that only Python 2.7 is supported as is mentioned in the docs.
Good. @tomdz Created init net. //--------------------------------------------------------------------------- Exception Traceback (most recent call last)
Tests which (sometimes) exhibit this behavior:
caffe2/python/optimizer_test.py::TestSgd::testSparse
Exception: No gradient registered for AveragedLoss. Exception from creating the gradient op: 'caffe2.python.caffe2_pybind11_state_gpu.Workspace' object has no attribute 'is_empty'.
caffe2/python/python_op_test.py::PythonOpTest::test_gradient
E0531 21:27:41.782681 337 pybind_state.cc:190] Exception encountered running PythonOp function: <type 'exceptions.AttributeError'>: 'caffe2.python.caffe2_pybind11_state_gpu.Workspace' object has no attribute 'shape'
@lukeyeager I've ran both of these in a loop without any issues. Could you give me your environment specs (OS, python version, C++ compiler) and the commandline you were using to run these tests ?
@tomdz I never can reproduce the failures locally either (otherwise I would have submitted a fix). These are all in our automated build+test environments. I shared them in hopes that the error messages would prompt one of the C2 devs to figure out the fix intuitively.
@Giszy sorry, that is very hard to read, did you drag and drop the HTML file that you got from File
-> Download as
-> HTML
in the IPython notebook, into the input box ?
@lukeyeager Sure, but can you get me the env spec and commandline used in your CI env ?
Command-line is pretty simple, just pytest -v $FILENAME
.
Environment is not simple (can share more details elsewhere). But it's not horribly broken or anything. All the Ctests and most of the Python tests are regularly passing.
when i run "python muti_gpu_train.py",this bug also be here.
are you sure fix this bug?
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Data folder found at /home/ubuntu/caffe2_notebooks/tutorial_data/resnet_trainer
Traceback (most recent call last):
File "muti_gpu_train.py", line 177, in
Invoked with: '\n\x18imonaboat/comp_15_conv_3\n\x1cimonaboat/comp_15_spatbn_3_s\n\x1cimonaboat/comp_15_spatbn_3_b\n\x1dimonaboat/comp_15_spatbn_3_rm\n\x1eimonaboat/comp_15_spatbn_3_riv\x12\x1aimonaboat/comp_15_spatbn_3\x12\x1dimonaboat/comp_15_spatbn_3_rm\x12\x1eimonaboat/comp_15_spatbn_3_riv\x12\x1dimonaboat/comp_15_spatbn_3_sm\x12\x1eimonaboat/comp_15_spatbn_3_siv\x1a\x00"\tSpatialBN\x0e\n\x07epsilon\x15o\x12\x83:\x1b\n\x17cudnn_exhaustive_search\x18\x00\x0b\n\x07is_test\x18\x00\r\n\tuse_cudnn\x18\x01\r\n\x05order"\x04NCHW\x0f\n\x08momentum\x15\xcd\xcc\xcc=2\x04\x08\x01\x10\x00', [<caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b67e0>, <caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b6810>, <caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b6840>, <caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b6870>, <caffe2.python.caffe2_pybind11_state_gpu.GradientWrapper object at 0x7f873d7b68a0>].
---------------I wish this bug could be fix at once.
Another slight variation (testing on 02ec758e154e66def63d90700d88f234446bc599):
caffe2/python/python_op_test.py::PythonOpTest::test_gradient_multiple_with_indicies
E0608 15:48:40.626631 363 pybind_state.cc:190] Exception encountered running PythonOp function: <type 'exceptions.AttributeError'>: 'caffe2.python.caffe2_pybind11_state_gpu.Workspace' object has no attribute 'reshape'
Hey @tomdz , let me know if you have any questions about this Caffe2 stuff, thank you for looking into this!
So I was looking into this some more and don't really see a reason why this would fail for you but not for me locally - other than changes in pybind11. What pybind11 version are you using ?
I'm just using the version in third_party/
.
Can you try again with https://github.com/caffe2/caffe2/commit/2e2d00f161f874ccbdea1ac209352e31a86e133e ?
@tomdz, my problem has been resolved. I think it was indeed caused by a unicode string.
@tomdz This resolves the issue for me as well. :+1:
Tested before your change (2f49ef37): 7/7 passed Tested after your change (2e2d00f1): 5/7 passed
Presumably the first pipeline succeeded by luck, since the failures are intermittent. The jobs which failed still had the same error:
Exception: No gradient registered for AveragedLoss. Exception from creating the gradient op: 'caffe2.python.caffe2_pybind11_state_gpu.Workspace' object has no attribute 'is_empty'.
Your change may still be helpful, but it didn't seem to fix the issue we're seeing.
@lukeyeager I think your issue is a different one. Could you file a separate task for it ?
Sure!
Made a new issue at https://github.com/caffe2/caffe2/issues/793.
In the latest code, I'm getting the following error:
@akyrola @salexspb