MehmetAygun / fusenet-pytorch

Other
80 stars 17 forks source link

Cuda related error #14

Closed zeng32 closed 4 years ago

zeng32 commented 4 years ago

Hi, When i first run the code: python train.py --dataroot datasets/sunrgbd --dataset sunrgbd --name sunrgbd I got the following error and got stuck, enlighten me, please. ----------------- Options --------------- batch_size: 4
...... initialize network with pretrained initialize network with kaiming model [FuseNetModel] was created ---------- Networks initialized ------------- [Network FuseNet] Total number of parameters : 44.187 M

create web directory ./checkpoints/sunrgbd/web... THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument Exception ignored in: <function _DataLoaderIter.del at 0x7f7348f6d3b0> Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 399, in del self._shutdown_workers() File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers self.worker_result_queue.get() File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/queues.py", line 354, in get return _ForkingPickler.loads(res) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd fd = df.detach() File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 499, in Client deliver_challenge(c, authkey) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 729, in deliver_challenge response = connection.recv_bytes(256) # reject large message File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer Traceback (most recent call last): File "train.py", line 64, in model.optimize_parameters() File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/fusenet_model.py", line 63, in optimize_parameters self.forward() File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/fusenet_model.py", line 53, in forward self.output = self.netFuseNet(self.rgb_image,self.depth_image) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/networks.py", line 216, in forward x_1 = self.CBR1_DEPTH_ENC(depth_inputs) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, **kwargs) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 301, in forward self.padding, self.dilation, self.groups) RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:663

My environment: Python: 3.7 torch: 0.4.1.post2 pytorch: 1.4.0 cuda : 10.1

hazirbas commented 4 years ago

It could be cuda/pytorch version problem. Can you try with cuda 10?

zeng32 commented 4 years ago

Thanks for the kind reply. But the error stays still... ---------- Networks initialized ------------- [Network FuseNet] Total number of parameters : 44.187 M

create web directory ./checkpoints/sunrgbd/web... THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument Traceback (most recent call last): File "train.py", line 64, in model.optimize_parameters() File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/fusenet_model.py", line 63, in optimize_parameters self.forward() File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/fusenet_model.py", line 53, in forward self.output = self.netFuseNet(self.rgb_image,self.depth_image) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/networks.py", line 216, in forward x_1 = self.CBR1_DEPTH_ENC(depth_inputs) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, **kwargs) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 301, in forward self.padding, self.dilation, self.groups) RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:663

Now my environment: Python: 3.7.6 torch: 0.4.1.post2 cuda : 10.0

MehmetAygun commented 4 years ago

I couldn't replicate the error, with Python 3.7.0, Cuda 10, torch 0.4.1.post2. People reported similar problems for other code bases, like @hazirbas said probably it is related to Pytorch/CUDA, maybe you can try to reinstall pytorch.

zeng32 commented 4 years ago

Thaks for the advice. Now after reinstalling everything including ubuntu 16.4. cuda problem is gone, following error appears: ----------------- Options --------------- batch_size: 4 ...... initialize network with pretrained initialize network with kaiming model [FuseNetModel] was created ---------- Networks initialized ------------- [Network FuseNet] Total number of parameters : 44.187 M

Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/util/connection.py", line 80, in create_connection raise err File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/util/connection.py", line 70, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connectionpool.py", line 354, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 1229, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 1275, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 1224, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 1016, in _send_output self.send(msg) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 956, in send self.connect() File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connection.py", line 181, in connect conn = self._new_conn() File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connection.py", line 168, in _new_conn self, "Failed to establish a new connection: %s" % e) urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f4ea099dcf8>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen _stacktrace=sys.exc_info()[2]) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/util/retry.py", line 398, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /env/main (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ea099dcf8>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/visdom/init.py", line 446, in _send data=json.dumps(msg), File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/api.py", line 116, in post return request('post', url, data=data, json=json, kwargs) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/api.py", line 60, in request return session.request(method=method, url=url, kwargs) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/sessions.py", line 524, in request resp = self.send(prep, send_kwargs) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/sessions.py", line 637, in send r = adapter.send(request, kwargs) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/adapters.py", line 516, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /env/main (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ea099dcf8>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 49, in visualizer = Visualizer(train_opt) File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/util/visualizer.py", line 72, in init self.vis = visdom.Visdom(server=opt.display_server, port=opt.display_port, env=opt.display_env, raise_exceptions=True) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/visdom/init.py", line 327, in init }, endpoint='env/' + env) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/visdom/init.py", line 460, in _send raise ConnectionError("Error connecting to Visdom server") ConnectionError: Error connecting to Visdom server

Now my environment: Python: 3.7.0 torch: 0.4.1.post2 cuda : 10.0

hazirbas commented 4 years ago

can you next time annotate and only post the error?

you need to start the visdom server.

zeng32 commented 4 years ago

Hi, sorry for the comment before. I have redone all the things. It still gets stuck at the cuda problem before: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument ...... RuntimeError: cuda runtime error (11) : invalid argument at pytorch/aten/src/THC/THCGeneral.cpp:663

Following is the microenvironment created using conda: Name Version
python 3.7.0 torch 0.4.1.post2
torchfile 0.1.0
torchvision 0.2.1

and cuda:10.0

hazirbas commented 4 years ago

This is related to pytorch and cuda backend if not, something goes wrong with your input data.