KidsWithTokens / MedSegDiff

Medical Image Segmentation with Diffusion Model
MIT License
976 stars 145 forks source link

error in training #142

Open apuomline opened 7 months ago

apuomline commented 7 months ago

Hello, author, I made a mistake in training. What is the specific reason for the error?

python scripts/segmentation_train.py --data_name ISIC --data_dir F:\liuxiao\project\dataset\isbi_3b_medsegdiff --out_dir F:\liuxiao\project\MedSegDiff\outdir --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8

error: The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "F:\miniconda\envs\medsegdiff\lib\site-packages\requests\adapters.py", line 486, in send resp = conn.urlopen( File "F:\miniconda\envs\medsegdiff\lib\site-packages\urllib3\connectionpool.py", line 844, in urlopen retries = retries.increment( File "F:\miniconda\envs\medsegdiff\lib\site-packages\urllib3\util\retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8850): Max retries exceeded with url: /env/main (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000024EDE7FCC40>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝,无法连接。'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "F:\miniconda\envs\medsegdiff\lib\site-packages\visdom__init__.py", line 756, in _send return self._handle_post( File "F:\miniconda\envs\medsegdiff\lib\site-packages\visdom__init__.py", line 720, in _handle_post r = self.session.post(url, data=data) File "F:\miniconda\envs\medsegdiff\lib\site-packages\requests\sessions.py", line 637, in post return self.request("POST", url, data=data, json=json, kwargs) File "F:\miniconda\envs\medsegdiff\lib\site-packages\requests\sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "F:\miniconda\envs\medsegdiff\lib\site-packages\requests\sessions.py", line 703, in send r = adapter.send(request, *kwargs) File "F:\miniconda\envs\medsegdiff\lib\site-packages\requests\adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8850): Max retries exceeded with url: /env/main (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000024EDE7FCC40>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝,无法连接。')) [WinError 10061] 由于目标计算机积极拒绝,无法连接。 on_close() takes 1 positional argument but 3 were given Visdom python client failed to establish socket to get messages from the server. This feature is optional and can be disabled by initializing Visdom with use_incoming_socket=False, which will prevent waiting for this request to timeout. [W socket.cpp:663] [c10d] The client socket has failed to connect to [::ffff:127.0.1.1]:59878 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): File "scripts/segmentation_train.py", line 118, in main() File "scripts/segmentation_train.py", line 26, in main dist_util.setup_dist(args) File "F:\liuxiao\project\MedSegDiff.\guided_diffusion\dist_util.py", line 46, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "F:\miniconda\envs\medsegdiff\lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper func_return = func(args, **kwargs) File "F:\miniconda\envs\medsegdiff\lib\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group defaultpg, = _new_process_group_helper( File "F:\miniconda\envs\medsegdiff\lib\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL built in") RuntimeError: Distributed package doesn't have NCCL built in

Lxycherryup commented 1 month ago

Have you solved this problem? I also encountered this problem

Mia01023 commented 1 month ago

same problem

Issues-translate-bot commented 1 month ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


same problem

fayeGou commented 1 week ago

how to solve this problem