Closed zeng32 closed 4 years ago
It could be cuda/pytorch version problem. Can you try with cuda 10?
create web directory ./checkpoints/sunrgbd/web...
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
File "train.py", line 64, in
Now my environment: Python: 3.7.6 torch: 0.4.1.post2 cuda : 10.0
I couldn't replicate the error, with Python 3.7.0, Cuda 10, torch 0.4.1.post2. People reported similar problems for other code bases, like @hazirbas said probably it is related to Pytorch/CUDA, maybe you can try to reinstall pytorch.
Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/util/connection.py", line 80, in create_connection raise err File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/util/connection.py", line 70, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connectionpool.py", line 354, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 1229, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 1275, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 1224, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 1016, in _send_output self.send(msg) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/http/client.py", line 956, in send self.connect() File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connection.py", line 181, in connect conn = self._new_conn() File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connection.py", line 168, in _new_conn self, "Failed to establish a new connection: %s" % e) urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f4ea099dcf8>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen _stacktrace=sys.exc_info()[2]) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/urllib3/util/retry.py", line 398, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /env/main (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ea099dcf8>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/visdom/init.py", line 446, in _send data=json.dumps(msg), File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/api.py", line 116, in post return request('post', url, data=data, json=json, kwargs) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/api.py", line 60, in request return session.request(method=method, url=url, kwargs) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/sessions.py", line 524, in request resp = self.send(prep, send_kwargs) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/sessions.py", line 637, in send r = adapter.send(request, kwargs) File "/home/ubuntu4067/anaconda3/envs/py370/lib/python3.7/site-packages/requests/adapters.py", line 516, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /env/main (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ea099dcf8>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 49, in
Now my environment: Python: 3.7.0 torch: 0.4.1.post2 cuda : 10.0
can you next time annotate and only post the error?
you need to start the visdom server.
Hi, sorry for the comment before. I have redone all the things. It still gets stuck at the cuda problem before: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument ...... RuntimeError: cuda runtime error (11) : invalid argument at pytorch/aten/src/THC/THCGeneral.cpp:663
Following is the microenvironment created using conda:
Name Version
python 3.7.0
torch 0.4.1.post2
torchfile 0.1.0
torchvision 0.2.1
and cuda:10.0
This is related to pytorch and cuda backend if not, something goes wrong with your input data.
Hi, When i first run the code: python train.py --dataroot datasets/sunrgbd --dataset sunrgbd --name sunrgbd I got the following error and got stuck, enlighten me, please. ----------------- Options --------------- batch_size: 4
...... initialize network with pretrained initialize network with kaiming model [FuseNetModel] was created ---------- Networks initialized ------------- [Network FuseNet] Total number of parameters : 44.187 M
create web directory ./checkpoints/sunrgbd/web... THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument Exception ignored in: <function _DataLoaderIter.del at 0x7f7348f6d3b0> Traceback (most recent call last): File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 399, in del self._shutdown_workers() File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers self.worker_result_queue.get() File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/queues.py", line 354, in get return _ForkingPickler.loads(res) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd fd = df.detach() File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 499, in Client deliver_challenge(c, authkey) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 729, in deliver_challenge response = connection.recv_bytes(256) # reject large message File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer Traceback (most recent call last): File "train.py", line 64, in
model.optimize_parameters()
File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/fusenet_model.py", line 63, in optimize_parameters
self.forward()
File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/fusenet_model.py", line 53, in forward
self.output = self.netFuseNet(self.rgb_image,self.depth_image)
File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, kwargs)
File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], *kwargs[0])
File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(input, kwargs)
File "/home/ubuntu4067/Documents/fusion/fusenet-pytorch/models/networks.py", line 216, in forward
x_1 = self.CBR1_DEPTH_ENC(depth_inputs)
File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, *kwargs)
File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(input, **kwargs)
File "/home/ubuntu4067/anaconda3/envs/fuse37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:663
My environment: Python: 3.7 torch: 0.4.1.post2 pytorch: 1.4.0 cuda : 10.1