在云上运行FCN网络的时候使用GPU进行训练会报这个错：FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1

nanwang-crea commented 6 months ago

这是完整的报错，网上搜了，很多讲的是进程之间通信的问题，这个问题要怎么解决呀？应该在代码中修改哪些位置？ Epoch: [0] [ 0/366] eta: 0:31:04 lr: 0.000000 loss: 2.1887 (2.1887) time: 5.0952 data: 0.7384 Epoch: [0] [ 10/366] eta: 0:15:24 lr: 0.000003 loss: 0.5890 (2.3867) time: 2.5974 data: 0.0681 Epoch: [0] [ 20/366] eta: 0:14:01 lr: 0.000006 loss: 0.2813 (1.7838) time: 2.2994 data: 0.0011 Epoch: [0] [ 30/366] eta: 0:13:21 lr: 0.000009 loss: 2.2992 (1.4588) time: 2.2671 data: 0.0010 Epoch: [0] [ 40/366] eta: 0:12:44 lr: 0.000011 loss: 1.2415 (1.4418) time: 2.2521 data: 0.0010 Epoch: [0] [ 50/366] eta: 0:12:14 lr: 0.000014 loss: 1.4934 (1.4652) time: 2.2295 data: 0.0010 Epoch: [0] [ 60/366] eta: 0:11:49 lr: 0.000017 loss: 0.5944 (1.4093) time: 2.2702 data: 0.0010 Epoch: [0] [ 70/366] eta: 0:11:23 lr: 0.000019 loss: 0.6704 (1.4132) time: 2.2722 data: 0.0010 Epoch: [0] [ 80/366] eta: 0:11:04 lr: 0.000022 loss: 0.3548 (1.3494) time: 2.3282 data: 0.0010 Epoch: [0] [ 90/366] eta: 0:10:39 lr: 0.000025 loss: 0.3015 (1.2649) time: 2.3509 data: 0.0011 Epoch: [0] [100/366] eta: 0:10:14 lr: 0.000028 loss: 0.6640 (1.2471) time: 2.2596 data: 0.0011 Epoch: [0] [110/366] eta: 0:09:51 lr: 0.000030 loss: 2.1179 (1.2050) time: 2.2716 data: 0.0010 Epoch: [0] [120/366] eta: 0:09:27 lr: 0.000033 loss: 2.0124 (1.2004) time: 2.3035 data: 0.0010 Epoch: [0] [130/366] eta: 0:09:04 lr: 0.000036 loss: 1.1753 (1.1981) time: 2.2837 data: 0.0010 Epoch: [0] [140/366] eta: 0:08:39 lr: 0.000039 loss: 2.3567 (1.2141) time: 2.2321 data: 0.0010 Epoch: [0] [150/366] eta: 0:08:18 lr: 0.000041 loss: 0.5729 (1.1973) time: 2.3115 data: 0.0010 Epoch: [0] [160/366] eta: 0:07:54 lr: 0.000044 loss: 0.4893 (1.2001) time: 2.3283 data: 0.0011 Epoch: [0] [170/366] eta: 0:07:30 lr: 0.000047 loss: 0.7241 (1.1839) time: 2.2304 data: 0.0011 Epoch: [0] [180/366] eta: 0:07:06 lr: 0.000050 loss: 1.3635 (1.1723) time: 2.2145 data: 0.0010 Traceback (most recent call last): File "/public/home/2023020919/FCN/train.py", line 206, in main(args) File "/public/home/2023020919/FCN/train.py", line 141, in main mean_loss, lr = train_one_epoch(model, optimizer, train_loader, device, epoch, File "/public/home/2023020919/FCN/train_utils/train_and_evals.py", line 42, in train_one_epoch for image, target in metric_logger.log_every(data_loader, print_freq, header): File "/public/home/2023020919/FCN/train_utils/distrributed_utils.py", line 189, in log_every for obj in iterable: File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data idx, data = self._get_data() File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data success, data = self._try_get_data() File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd fd = df.detach() File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 502, in Client c = SocketClient(address) File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient s.connect(address) FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1

NewBeeMrz commented 5 months ago

请问您的问题解决了吗，我也是这个问题

nanwang-crea commented 5 months ago

解决了，还是目录的位置选择不对，需要再修改一下

木南 @.***

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月10日(星期一) 下午2:59 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错：FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801)

请问您的问题解决了吗，我也是这个问题

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

NewBeeMrz commented 5 months ago

解决了，还是目录的位置选择不对，需要再修改一下木南 @. … ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月10日(星期一) 下午2:59 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错：FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801) 请问您的问题解决了吗，我也是这个问题 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

您说的是数据集的位置吗，我的数据集就在个人账号的根目录下，与程序目录同级，但是我跑起来就是这个问题

nanwang-crea commented 5 months ago

还包括你运行文件的路径，建议改成绝对路径试一下

木南 @.***

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月10日(星期一) 下午3:14 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错：FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801)

解决了，还是目录的位置选择不对，需要再修改一下木南 @. … ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月10日(星期一) 下午2:59 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错：FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801) 请问您的问题解决了吗，我也是这个问题 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

您说的是数据集的位置吗，我的数据集就在个人账号的根目录下，与程序目录同级，但是我跑起来就是这个问题

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

NewBeeMrz commented 5 months ago

还包括你运行文件的路径，建议改成绝对路径试一下木南 @. … ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月10日(星期一) 下午3:14 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错：FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801) 解决了，还是目录的位置选择不对，需要再修改一下木南 @. … ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月10日(星期一) 下午2:59 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错：FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801) 请问您的问题解决了吗，我也是这个问题 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.> 您说的是数据集的位置吗，我的数据集就在个人账号的根目录下，与程序目录同级，但是我跑起来就是这个问题 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

我已经解决，就是数据集文件位置的原因，谢谢

WZMIAOMIAO / deep-learning-for-image-processing

在云上运行FCN网络的时候使用GPU进行训练会报这个错：FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 #801