Closed shuaijun-36 closed 1 year ago
I never had such problems on my side. I would suggest you have a try using the MoCo v2 model in the original openselfsup or ConCL. If you still have the same problem, it might have something to do with your environment. Otherwise, you can directly email me to see if we can debug it together.
2023-06-28 11:50:00,219 - openselfsup - INFO - workflow: [('train', 1)], max: 200 epochs 2023-06-28 11:50:00,219 - openselfsup - INFO - Checkpoints will be saved to /home/user/disk1/tjj/code/PLM_SSL/output_dir by HardDiskBackend. 2023-06-28 11:50:08,893 - torch.nn.parallel.distributed - INFO - Reducer buckets have been rebuilt in this iteration. 2023-06-28 11:50:08,895 - torch.nn.parallel.distributed - INFO - Reducer buckets have been rebuilt in this iteration. Traceback (most recent call last): File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 145, in _serve send(conn, destination_pid) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 50, in send reduction.send_handle(conn, new_fd, pid) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/reduction.py", line 184, in send_handle sendfds(s, [handle]) File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/reduction.py", line 149, in sendfds sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)]) OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 147, in _serve close() File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 52, in close os.close(new_fd) OSError: [Errno 9] Bad file descriptor Traceback (most recent call last): File "/home/user/disk1/tjj/code/PLM_SSL/tools/train.py", line 170, in
main()
File "/home/user/disk1/tjj/code/PLM_SSL/tools/train.py", line 160, in main
train_model(
File "/home/user/disk1/tjj/code/PLM_SSL/openselfsup/apis/train.py", line 97, in train_model
_dist_train(
File "/home/user/disk1/tjj/code/PLM_SSL/openselfsup/apis/train.py", line 228, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/runner/epoch_based_runner.py", line 49, in train
for i, data_batch in enumerate(self.data_loader):
File "/home/user/disk1/tjj/code/PLM_SSL/openselfsup/datasets/loader/build_loader.py", line 110, in iter
for next_input_dict in self.loader:
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
success, data = self._try_get_data()
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_fd
fd = df.detach()
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8714 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8713) of binary: /home/user/disk1/anaconda3/envs/seg/bin/python
Traceback (most recent call last):
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/disk1/anaconda3/envs/seg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
did you ever encounter this 'EOF error' promblem? how to solve it?