cvlab-stonybrook / PLM_SSL

Repository for "Precise Location Matching Improves Dense Contrastive Learning in Digital Pathology"
8 stars 2 forks source link

'EOF error' promblem #3

Closed shuaijun-36 closed 1 year ago

shuaijun-36 commented 1 year ago

2023-06-28 11:50:00,219 - openselfsup - INFO - workflow: [('train', 1)], max: 200 epochs 2023-06-28 11:50:00,219 - openselfsup - INFO - Checkpoints will be saved to /home/user/disk1/tjj/code/PLM_SSL/output_dir by HardDiskBackend. 2023-06-28 11:50:08,893 - torch.nn.parallel.distributed - INFO - Reducer buckets have been rebuilt in this iteration. 2023-06-28 11:50:08,895 - torch.nn.parallel.distributed - INFO - Reducer buckets have been rebuilt in this iteration. Traceback (most recent call last): File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 145, in _serve send(conn, destination_pid) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 50, in send reduction.send_handle(conn, new_fd, pid) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/reduction.py", line 184, in send_handle sendfds(s, [handle]) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/reduction.py", line 149, in sendfds sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)]) OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 147, in _serve close() File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 52, in close os.close(new_fd) OSError: [Errno 9] Bad file descriptor Traceback (most recent call last): File "/home/user/disk1/tjj/code/PLM_SSL/tools/train.py", line 170, in main() File "/home/user/disk1/tjj/code/PLM_SSL/tools/train.py", line 160, in main train_model( File "/home/user/disk1/tjj/code/PLM_SSL/openselfsup/apis/train.py", line 97, in train_model _dist_train( File "/home/user/disk1/tjj/code/PLM_SSL/openselfsup/apis/train.py", line 228, in _dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], **kwargs) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/runner/epoch_based_runner.py", line 49, in train for i, data_batch in enumerate(self.data_loader): File "/home/user/disk1/tjj/code/PLM_SSL/openselfsup/datasets/loader/build_loader.py", line 110, in iter for next_input_dict in self.loader: File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data idx, data = self._get_data() File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data success, data = self._try_get_data() File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_fd fd = df.detach() File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle return recvfds(s, 1)[0] File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/reduction.py", line 159, in recvfds raise EOFError EOFError WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8714 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8713) of binary: /home/user/disk1/anaconda3/envs/seg/bin/python Traceback (most recent call last): File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError

did you ever encounter this 'EOF error' promblem? how to solve it?

jingweizhang-xyz commented 1 year ago

I never had such problems on my side. I would suggest you have a try using the MoCo v2 model in the original openselfsup or ConCL. If you still have the same problem, it might have something to do with your environment. Otherwise, you can directly email me to see if we can debug it together.