Training Issue:OSError: [Errno 24] Too many open files

wangjuansan commented 1 year ago

Hi~ When I run python train.py --cfg_file cfgs/scannet_models/spconv_clip_base15_caption_adamw.yaml,


2023-08-29 12:51:48,581   INFO  cfg_file         cfgs/scannet_models/spconv_clip_base15_caption_adamw.yaml
2023-08-29 12:51:48,583   INFO  batch_size       4
2023-08-29 12:51:48,584   INFO  epochs           128
2023-08-29 12:51:48,586   INFO  workers          4
...
2023-08-29 12:51:55,038   INFO  **********************Start training scannet_models/spconv_clip_base15_caption_adamw(default)**********************
2023-08-29 12:53:11,734   INFO  Epoch [1/128][20/1201] LR: 0.004, ETA: 6 days, 18:09:08, Data: 0.08 (0.42), Iter: 1.91 (3.80), Accuracy: 0.56, loss_seg=1.61, binary_loss=0.46, caption_view=0.17, caption_entity=0.15, loss=2.38, n_captions=13.2(13.1)
2023-08-29 12:53:51,830   INFO  Epoch [1/128][40/1201] LR: 0.004, ETA: 5 days, 3:52:20, Data: 0.05 (0.25), Iter: 1.97 (2.90), Accuracy: 0.54, loss_seg=1.55, binary_loss=0.29, caption_view=0.16, caption_entity=0.14, loss=2.15, n_captions=13.2(12.8)
2023-08-29 12:54:29,663   INFO  Epoch [1/128][60/1201] LR: 0.004, ETA: 4 days, 13:28:58, Data: 0.07 (0.19), Iter: 2.13 (2.56), Accuracy: 0.61, loss_seg=1.42, binary_loss=0.36, caption_view=0.17, caption_entity=0.15, loss=2.11, n_captions=16.2(12.8)
2023-08-29 12:55:09,310   INFO  Epoch [1/128][80/1201] LR: 0.004, ETA: 4 days, 7:14:50, Data: 0.09 (0.16), Iter: 2.17 (2.42), Accuracy: 0.67, loss_seg=1.28, binary_loss=0.35, caption_view=0.19, caption_entity=0.17, loss=1.99, n_captions=18.8(12.7)
2023-08-29 12:55:47,967   INFO  Epoch [1/128][100/1201] LR: 0.004, ETA: 4 days, 3:05:21, Data: 0.10 (0.14), Iter: 2.03 (2.32), Accuracy: 0.66, loss_seg=1.24, binary_loss=0.17, caption_view=0.16, caption_entity=0.15, loss=1.72, n_captions=13.2(12.7)
2023-08-29 12:56:31,885   INFO  Epoch [1/128][120/1201] LR: 0.004, ETA: 4 days, 2:10:41, Data: 0.06 (0.13), Iter: 1.77 (2.30), Accuracy: 0.70, loss_seg=1.20, binary_loss=0.21, caption_view=0.14, caption_entity=0.10, loss=1.65, n_captions=7.5(12.9)
2023-08-29 12:57:11,898   INFO  Epoch [1/128][140/1201] LR: 0.004, ETA: 4 days, 0:20:09, Data: 0.08 (0.12), Iter: 1.93 (2.26), Accuracy: 0.63, loss_seg=1.25, binary_loss=0.19, caption_view=0.15, caption_entity=0.14, loss=1.73, n_captions=9.8(12.9)
2023-08-29 12:57:50,126   INFO  Epoch [1/128][160/1201] LR: 0.004, ETA: 3 days, 22:28:16, Data: 0.07 (0.11), Iter: 2.11 (2.21), Accuracy: 0.67, loss_seg=1.21, binary_loss=0.19, caption_view=0.16, caption_entity=0.15, loss=1.71, n_captions=12.0(12.8)
2023-08-29 12:58:31,512   INFO  Epoch [1/128][180/1201] LR: 0.004, ETA: 3 days, 21:46:22, Data: 0.07 (0.11), Iter: 1.72 (2.20), Accuracy: 0.61, loss_seg=1.41, binary_loss=0.28, caption_view=0.16, caption_entity=0.09, loss=1.93, n_captions=9.0(12.9)
2023-08-29 12:59:12,194   INFO  Epoch [1/128][200/1201] LR: 0.004, ETA: 3 days, 21:03:14, Data: 0.07 (0.11), Iter: 2.43 (2.18), Accuracy: 0.54, loss_seg=1.49, binary_loss=0.26, caption_view=0.18, caption_entity=0.17, loss=2.09, n_captions=18.0(13.0)
2023-08-29 12:59:51,518   INFO  Epoch [1/128][220/1201] LR: 0.004, ETA: 3 days, 20:12:49, Data: 0.05 (0.10), Iter: 1.69 (2.16), Accuracy: 0.74, loss_seg=0.96, binary_loss=0.26, caption_view=0.14, caption_entity=0.10, loss=1.46, n_captions=7.8(12.9)
2023-08-29 13:00:32,747   INFO  Epoch [1/128][240/1201] LR: 0.004, ETA: 3 days, 19:50:16, Data: 0.10 (0.10), Iter: 2.08 (2.15), Accuracy: 0.60, loss_seg=1.40, binary_loss=0.28, caption_view=0.16, caption_entity=0.15, loss=2.01, n_captions=12.5(12.9)
2023-08-29 13:01:13,779   INFO  Epoch [1/128][260/1201] LR: 0.004, ETA: 3 days, 19:29:12, Data: 0.06 (0.10), Iter: 2.38 (2.15), Accuracy: 0.62, loss_seg=1.29, binary_loss=0.23, caption_view=0.18, caption_entity=0.16, loss=1.86, n_captions=19.0(12.9)
2023-08-29 13:01:52,565   INFO  Epoch [1/128][280/1201] LR: 0.004, ETA: 3 days, 18:51:06, Data: 0.06 (0.09), Iter: 1.83 (2.13), Accuracy: 0.66, loss_seg=1.18, binary_loss=0.27, caption_view=0.15, caption_entity=0.12, loss=1.72, n_captions=9.8(12.9)
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
2023-08-29 13:02:32,142   INFO  Epoch [1/128][300/1201] LR: 0.004, ETA: 3 days, 18:24:18, Data: 0.08 (0.09), Iter: 1.62 (2.12), Accuracy: 0.52, loss_seg=1.61, binary_loss=0.20, caption_view=0.16, caption_entity=0.13, loss=2.10, n_captions=12.0(12.9)

Then, I reduced the num_workers and the batch_size

2023-08-29 14:35:17,045   INFO  **********************Start logging**********************
2023-08-29 14:35:17,045   INFO  CUDA_VISIBLE_DEVICES=ALL
2023-08-29 14:35:17,046   INFO  cfg_file         cfgs/scannet_models/spconv_clip_base15_caption_adamw.yaml
2023-08-29 14:35:17,058   INFO  batch_size       1
2023-08-29 14:35:17,060   INFO  epochs           128
2023-08-29 14:35:17,062   INFO  workers          1
...
2023-08-29 14:35:24,926   INFO  **********************Start training scannet_models/spconv_clip_base15_caption_adamw(default)**********************
2023-08-29 14:35:43,175   INFO  Epoch [1/128][20/4804] LR: 0.004, ETA: 6 days, 9:57:34, Data: 0.01 (0.03), Iter: 0.80 (0.90), Accuracy: 0.57, loss_seg=1.62, binary_loss=0.61, caption_view=0.08, caption_entity=0.06, loss=2.37, n_captions=11.0(11.8)
2023-08-29 14:35:57,832   INFO  Epoch [1/128][40/4804] LR: 0.004, ETA: 5 days, 19:34:41, Data: 0.02 (0.03), Iter: 0.67 (0.82), Accuracy: 0.77, loss_seg=1.19, binary_loss=0.52, caption_view=0.04, caption_entity=0.00, loss=1.75, n_captions=3.0(12.0)
2023-08-29 14:36:12,468   INFO  Epoch [1/128][60/4804] LR: 0.004, ETA: 5 days, 14:40:01, Data: 0.03 (0.03), Iter: 0.80 (0.79), Accuracy: 0.59, loss_seg=1.53, binary_loss=0.60, caption_view=0.11, caption_entity=0.07, loss=2.32, n_captions=15.0(12.6)
2023-08-29 14:36:25,845   INFO  Epoch [1/128][80/4804] LR: 0.004, ETA: 5 days, 9:36:36, Data: 0.01 (0.02), Iter: 0.28 (0.76), Accuracy: 0.43, loss_seg=2.38, binary_loss=0.31, caption_view=0.10, caption_entity=0.07, loss=2.86, n_captions=13.0(12.4)
2023-08-29 14:36:39,422   INFO  Epoch [1/128][100/4804] LR: 0.004, ETA: 5 days, 6:51:36, Data: 0.02 (0.02), Iter: 0.46 (0.74), Accuracy: 0.43, loss_seg=1.95, binary_loss=0.26, caption_view=0.09, caption_entity=0.07, loss=2.37, n_captions=10.0(12.9)
2023-08-29 14:36:54,429   INFO  Epoch [1/128][120/4804] LR: 0.004, ETA: 5 days, 7:03:13, Data: 0.02 (0.02), Iter: 0.70 (0.74), Accuracy: 0.78, loss_seg=0.90, binary_loss=0.33, caption_view=0.08, caption_entity=0.04, loss=1.34, n_captions=9.0(12.9)
2023-08-29 14:37:07,454   INFO  Epoch [1/128][140/4804] LR: 0.004, ETA: 5 days, 4:47:05, Data: 0.02 (0.03), Iter: 0.79 (0.73), Accuracy: 0.64, loss_seg=1.29, binary_loss=0.16, caption_view=0.06, caption_entity=0.03, loss=1.55, n_captions=7.0(12.8)
2023-08-29 14:37:21,984   INFO  Epoch [1/128][160/4804] LR: 0.004, ETA: 5 days, 4:40:53, Data: 0.03 (0.03), Iter: 0.91 (0.73), Accuracy: 0.70, loss_seg=1.09, binary_loss=0.18, caption_view=0.11, caption_entity=0.08, loss=1.47, n_captions=15.0(13.1)
2023-08-29 14:37:35,895   INFO  Epoch [1/128][180/4804] LR: 0.004, ETA: 5 days, 4:02:25, Data: 0.01 (0.02), Iter: 0.45 (0.73), Accuracy: 0.67, loss_seg=1.09, binary_loss=0.24, caption_view=0.06, caption_entity=0.00, loss=1.39, n_captions=4.0(13.1)
2023-08-29 14:37:49,432   INFO  Epoch [1/128][200/4804] LR: 0.004, ETA: 5 days, 3:11:09, Data: 0.03 (0.02), Iter: 0.70 (0.72), Accuracy: 0.78, loss_seg=0.98, binary_loss=0.14, caption_view=0.09, caption_entity=0.06, loss=1.27, n_captions=10.0(13.1)
2023-08-29 14:38:02,928   INFO  Epoch [1/128][220/4804] LR: 0.004, ETA: 5 days, 2:27:42, Data: 0.01 (0.02), Iter: 0.55 (0.72), Accuracy: 0.83, loss_seg=0.75, binary_loss=0.26, caption_view=0.09, caption_entity=0.02, loss=1.12, n_captions=7.0(13.3)
2023-08-29 14:38:18,365   INFO  Epoch [1/128][240/4804] LR: 0.004, ETA: 5 days, 3:14:10, Data: 0.03 (0.02), Iter: 0.45 (0.72), Accuracy: 0.52, loss_seg=2.06, binary_loss=0.30, caption_view=0.11, caption_entity=0.12, loss=2.59, n_captions=24.0(13.4)
2023-08-29 14:38:31,951   INFO  Epoch [1/128][260/4804] LR: 0.004, ETA: 5 days, 2:40:36, Data: 0.02 (0.02), Iter: 0.39 (0.72), Accuracy: 0.78, loss_seg=0.89, binary_loss=0.15, caption_view=0.07, caption_entity=0.03, loss=1.15, n_captions=6.0(13.2)
2023-08-29 14:38:45,580   INFO  Epoch [1/128][280/4804] LR: 0.004, ETA: 5 days, 2:13:03, Data: 0.02 (0.02), Iter: 0.49 (0.72), Accuracy: 0.44, loss_seg=2.29, binary_loss=0.32, caption_view=0.11, caption_entity=0.05, loss=2.77, n_captions=12.0(13.1)
2023-08-29 14:38:59,079   INFO  Epoch [1/128][300/4804] LR: 0.004, ETA: 5 days, 1:44:38, Data: 0.02 (0.02), Iter: 0.70 (0.71), Accuracy: 0.69, loss_seg=1.06, binary_loss=0.17, caption_view=0.09, caption_entity=0.07, loss=1.39, n_captions=10.0(13.0)
2023-08-29 14:39:13,142   INFO  Epoch [1/128][320/4804] LR: 0.004, ETA: 5 days, 1:37:37, Data: 0.02 (0.02), Iter: 0.95 (0.71), Accuracy: 0.72, loss_seg=1.03, binary_loss=0.86, caption_view=0.13, caption_entity=0.13, loss=2.15, n_captions=28.0(13.1)
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_21979_3890372270> in read-write mode

I guess it is because the open file limited.So I set ulimit -n 10240.

But I still got

2023-08-29 16:39:54,976   INFO  Epoch [1/128][920/2402] LR: 0.004, ETA: 4 days, 2:55:56, Data: 0.05 (0.04), Iter: 0.87 (1.16), Accuracy: 0.77, loss_seg=0.80, binary_loss=0.28, caption_view=0.13, caption_entity=0.11, loss=1.32, n_captions=13.0(13.1)
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files

Have you ever had a similar problem? Looking forward to your reply!

jihanyang commented 1 year ago

Yes, I have met similar problem, but it should be already addressed in our code. Have you modified any codes?

wangjuansan commented 1 year ago

Thank you for your reply! I didn't change the code. BTW, May I ask when you ran into this problem, what parts of the code did you change to solve the problem?

jihanyang commented 1 year ago

Hello, It seems that this is caused by a implicitly deep copy of a list in the dataloader. For example, in OpenPCDet, https://github.com/open-mmlab/OpenPCDet/blob/bce886d6e36e3deaec2ce6f8b54b133fcadf62d6/pcdet/datasets/kitti/kitti_dataset.py#L376

They need to do a explicitly deepcopy instead of directly get the info from the info list.

As for our PLA, since I cannot reproduce this problem now. I think you can first try to run https://github.com/CVMI-Lab/PLA/blob/main/tools/cfgs/scannet_models/spconv_clip_adamw.yaml this cfg. And check whether this problem exist. If this problem disappear, the problem should caused by loading caption files.

wangjuansan commented 1 year ago

Yes, I run python train.py --cfg_file cfgs/scannet_models/spconv_clip_adamw.yaml today and this problem didn't occur.

Should I change this line? https://github.com/CVMI-Lab/PLA/blob/3d8494b9499c9f7f2b1ed869fd7f57c960042a1a/pcseg/datasets/indoor_dataset.py#L201C45-L201C45

jihanyang commented 1 year ago

This line already use deep copy, so it is not supposed to cause this problem. But I think some attempts is worthy, since the problem is caused by load captions and caption idx.

Since we cannot reproduce this problem for now, I can only provide some possible solution. Here is a possible manner: in my experience, this error will take place in a specific iteration, as the limit of opening files is a scalar. You can find that number by checking which iteration it will stop, and then set a breakpoint in that iteration, checking which line of loading code results in problem.

jihanyang commented 1 year ago

You can also check this issue https://github.com/pytorch/pytorch/issues/11201. If you use a personal computer, this should work in my experience, but this solution does not work on the cluster that I used.

wangjuansan commented 1 year ago

Thank you for your help. I added the follwing code and it worked.

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

wangjuansan commented 1 year ago

Thank you for your help. I added the follwing code and it worked.
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

But it will be out of memory at some epoch with this method

2023-08-31 20:27:52,687   INFO  Epoch [5/128][240/400] LR: 0.004, ETA: 3 days, 4:36:35, Data: 0.15 (0.14), Iter: 7.69 (5.59), Accuracy: 0.76, loss_seg=0.71, binary_loss=0.31, caption_view=0.20, caption_entity=0.19, loss=1.41, n_captions=17.3(12.8)
2023-08-31 20:29:48,883   INFO  Epoch [5/128][260/400] LR: 0.004, ETA: 3 days, 4:48:50, Data: 0.10 (0.14), Iter: 6.13 (5.60), Accuracy: 0.78, loss_seg=0.65, binary_loss=0.20, caption_view=0.19, caption_entity=0.18, loss=1.21, n_captions=13.8(12.9)
2023-08-31 20:31:38,831   INFO  Epoch [5/128][280/400] LR: 0.004, ETA: 3 days, 4:40:42, Data: 0.10 (0.13), Iter: 5.31 (5.60), Accuracy: 0.83, loss_seg=0.53, binary_loss=0.28, caption_view=0.18, caption_entity=0.17, loss=1.16, n_captions=13.4(12.9)
Traceback (most recent call last):
  File "train.py", line 260, in <module>
    if __name__ == '__main__':
  File "train.py", line 255, in main
    text_encoder=text_encoder,
  File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 207, in train_model
    text_encoder=text_encoder
  File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 92, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 2.16 GiB (GPU 0; 23.70 GiB total capacity; 17.34 GiB already allocated; 1.88 GiB free; 18.27 GiB reserved in total by PyTorch)

wangjuansan commented 1 year ago

The breakpoint stop at https://github.com/CVMI-Lab/PLA/blob/3d8494b9499c9f7f2b1ed869fd7f57c960042a1a/tools/train_utils/train_utils.py#L47, cur_it = 27. total_it_each_epoch = 600 ,

scaler = torch.cuda.amp.GradScaler(enabled=args.use_amp    for cur_it in range(total_it_each_epoch):
        try:
            batch = next(dataloader_iter)
        except StopIteration:
            dataloader_iter = iter(train_loader)
            batch = next(dataloader_iter)
            print('new iters')
        batch['epoch'] = cur_epoch

        data_timer = time.time()
        cur_data_time = data_timer - end

jihanyang commented 1 year ago

For the out of memory situation, may I know your batch size and GPU memory?

For the breakpoint, you should step into your dataloader getitem function, especially, the loading of caption files.

wangjuansan commented 1 year ago

In fact, I tried several times different combinations of batch size and GPU memory. batchsize = 16,14,12. and GPU 48G. batchsize = 12,8. and GPU 24G.

When I set ulimit -n 10240, and added the code

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

I found that the code would have this problem at the 2nd or 5th epoch of training.I don't think this should have anything to do with GPU memory, because with this memory and batch size setting, the code can train a full epoch and it shouldn't be caused by a lack of memory.

When I set ulimit -n 1024 as default, bachsize=8,GPU 24G, remove

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

the code quickly reported an error in the first epoch.

2023-09-01 10:45:37,915   INFO  **********************Start training scannet_models/spconv_clip_base15_caption_adamw(default)**********************
2023-09-01 10:56:17,736   INFO  Epoch [1/128][20/600] LR: 0.004, ETA: 27 days, 23:45:32, Data: 2.11 (26.34), Iter: 5.29 (31.50), Accuracy: 0.50, loss_seg=1.77, binary_loss=0.47, caption_view=0.22, caption_entity=0.21, loss=2.66, n_captions=17.4(11.9)
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 149, in _serve
    send(conn, destination_pid)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 179, in send_handle
    with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/socket.py", line 463, in fromfd
    nfd = dup(fd)
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError
EOFError

For the breakpoint, could you please give me the locations of the code that need to be step into? Because I'm not farmiliar with your code and it did take me a lot of time to locate it.Thank you very much.

jihanyang commented 1 year ago

Hello, your previous link is correct, you should mainly focus this function: https://github.com/CVMI-Lab/PLA/blob/3d8494b9499c9f7f2b1ed869fd7f57c960042a1a/pcseg/datasets/indoor_dataset.py#L196

wangjuansan commented 1 year ago

Sorry, it is a little bit hard to figure out the error for me. I have tried many times and every time when I debug it is stuck .

At first, I got OSError: [Errno 24] Too many open files and RuntimeError: unable to open shared memory object </torch_235148_4273977650> in read-write mode.

2023-09-04 18:47:30,763   INFO  Epoch [1/128][20/600] LR: 0.004, ETA: 6 days, 23:36:13, Data: 0.08 (0.47), Iter: 3.16 (7.86), Accuracy: 0.57, loss_seg=1.48, binary_loss=0.49, caption_view=0.20, caption_entity=0.16, loss=2.33, n_captions=11.6(12.5)
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_235148_4273977650> in read-write mode
Traceback (most recent call last):
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files

But if I add

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

I got RuntimeError: CUDA out of memory. Obviously, it not because of the cuda memory. Because when I reduce batchsize=4 or 2, I will also get this error, but in 4th or 7th.

2023-09-04 17:59:04,135   INFO  Epoch [1/128][20/600] LR: 0.004, ETA: 9 days, 1:48:45, Data: 0.13 (0.47), Iter: 6.44 (10.21), Accuracy: 0.56, loss_seg=1.51, binary_loss=0.51, caption_view=0.19, caption_entity=0.18, loss=2.39, n_captions=11.0(12.5)
2023-09-04 18:01:05,994   INFO  Epoch [1/128][40/600] LR: 0.004, ETA: 7 days, 5:48:45, Data: 0.10 (0.29), Iter: 6.13 (8.15), Accuracy: 0.61, loss_seg=1.29, binary_loss=0.47, caption_view=0.20, caption_entity=0.18, loss=2.14, n_captions=15.2(12.7)
2023-09-04 18:03:11,054   INFO  Epoch [1/128][60/600] LR: 0.004, ETA: 6 days, 16:17:21, Data: 0.14 (0.22), Iter: 5.43 (7.52), Accuracy: 0.63, loss_seg=1.29, binary_loss=0.32, caption_view=0.18, caption_entity=0.15, loss=1.94, n_captions=7.4(12.6)
2023-09-04 18:05:14,054   INFO  Epoch [1/128][80/600] LR: 0.004, ETA: 6 days, 8:57:07, Data: 0.12 (0.19), Iter: 6.72 (7.18), Accuracy: 0.63, loss_seg=1.18, binary_loss=0.30, caption_view=0.20, caption_entity=0.17, loss=1.85, n_captions=13.1(12.5)
2023-09-04 18:07:22,827   INFO  Epoch [1/128][100/600] LR: 0.004, ETA: 6 days, 5:46:11, Data: 0.10 (0.17), Iter: 6.16 (7.03), Accuracy: 0.54, loss_seg=1.45, binary_loss=0.46, caption_view=0.19, caption_entity=0.16, loss=2.26, n_captions=10.4(12.7)
2023-09-04 18:09:18,085   INFO  Epoch [1/128][120/600] LR: 0.004, ETA: 6 days, 1:13:48, Data: 0.09 (0.16), Iter: 6.99 (6.82), Accuracy: 0.71, loss_seg=1.05, binary_loss=0.40, caption_view=0.20, caption_entity=0.18, loss=1.82, n_captions=13.4(12.6)
2023-09-04 18:11:29,505   INFO  Epoch [1/128][140/600] LR: 0.004, ETA: 6 days, 0:26:09, Data: 0.08 (0.15), Iter: 5.18 (6.78), Accuracy: 0.67, loss_seg=1.07, binary_loss=0.32, caption_view=0.20, caption_entity=0.18, loss=1.77, n_captions=12.4(12.7)
2023-09-04 18:13:37,711   INFO  Epoch [1/128][160/600] LR: 0.004, ETA: 5 days, 23:24:20, Data: 0.14 (0.14), Iter: 7.65 (6.74), Accuracy: 0.69, loss_seg=1.04, binary_loss=0.28, caption_view=0.19, caption_entity=0.14, loss=1.66, n_captions=11.9(12.9)
Traceback (most recent call last):
  File "train.py", line 260, in <module>
    main()
  File "train.py", line 255, in main
    arnold=arnold
  File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 207, in train_model
    text_encoder=text_encoder
  File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 92, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.44 GiB (GPU 0; 23.70 GiB total capacity; 11.56 GiB already allocated; 1.23 GiB free; 12.34 GiB reserved in total by PyTorch)

Here are my guesses.

I'm wondering if it's caused by duplicate caption files loading.For example, constantly opening a file without closing it or constantly loading it into memory but not emptying it, which leads to the problem not occurring when the sample is small, but only when the sample is large.

Because when the batchsie is small, the training can even run through a complete epoch. This means that the memory occupied during the first epoch is not completely freed or the files are not closed, resulting in the second epoch to continue to add to the memory or continue to open the file, which ultimately leads to this problem.

BTW, when you last encountered this problem, how did you locate it and what caused it? I know he was loading caption to cause it, but could you please give more specific information?(If you can't remember, maybe look up the git commit history.)

Looking forward to your reply.

Dingry commented 1 year ago

Hi, we encountered this issue while indexing the corresponding caption and resolved it by adding 'copy.deepcopy'. https://github.com/CVMI-Lab/PLA/blob/3d8494b9499c9f7f2b1ed869fd7f57c960042a1a/pcseg/datasets/indoor_dataset.py#L201. Currently, we are unable to reproduce this error, but we will investigate the cause at a later time.

wangjuansan commented 1 year ago

Hi, thank you for your reply. If deepcopy resolved this problem. Then I think it didn't work in my training. Because I set breakpoint in line 201 "cur_caption_idx = copy.deepcopy(self.scene_image_corr_infos[index]) ", but it didn't get into it. It get into if isinstance(self.scene_image_corr_infos, dict), but not else, then the occurs.

if hasattr(self, 'scene_image_corr_infos') and self.scene_image_corr_infos is not None:
            if isinstance(self.scene_image_corr_infos, dict):
                # assert scene_name in self.scene_image_corr_infos
                info = self.scene_image_corr_infos.get(scene_name, {})
            else:
                cur_caption_idx = copy.deepcopy(self.scene_image_corr_infos[index])
                assert scene_name == cur_caption_idx['scene_name']
                info = cur_caption_idx['infos']
            if len(info) > 0:
                image_name_view, image_corr_view = zip(*info.items())
            else:
                image_name_view, image_corr_view = [], []
            image_name_dict['view'] = image_name_view
            image_corr_dict['view'] = image_corr_view

Dingry commented 1 year ago

Hi, in that case, could you please try wrapping the self.scene_image_corr_infos.get(scene_name, {}) with copy.deepcopy function?

wangjuansan commented 1 year ago

Hi, thank you for your help. I followed your advice and wrapping self.scene_image_corr_infos.get(scene_name, {}) and self.scene_image_corr_entity_infos.get(scene_name, {}) with copy.deepcopy. Now it's not reporting errors, I'll wait for the training to complete to see if the issue recurs.

wangjuansan commented 1 year ago

Hi, batchsize=12. GPU 24G. python train.py --cfg_file cfgs/scannet_models/spconv_clip_base15_caption_adamw.yaml

The training stop at 32th epoch with the error of "CUDA out of memory.". Are there other places in the code that need to be changed as well?

2023-09-06 08:08:17,913   INFO  Epoch [32/128][160/400] LR: 0.004, ETA: 2 days, 10:44:09, Data: 0.07 (0.10), Iter: 3.72 (5.47), Accuracy: 0.89, loss_seg=0.30, binary_loss=0.22, caption_view=0.15, caption_entity=0.13, loss=0.79, n_captions=7.8(13.0)
Traceback (most recent call last):
  File "train.py", line 261, in <module>
    main()
  File "train.py", line 256, in main
    arnold=arnold
  File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 207, in train_model
    text_encoder=text_encoder
  File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 92, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 2.39 GiB (GPU 0; 23.70 GiB total capacity; 19.30 GiB already allocated; 1.97 GiB free; 20.02 GiB reserved in total by PyTorch)

Dingry commented 1 year ago

The batch size is set too large. Maybe you can try 4 batch sizes per GPU.

wangjuansan commented 1 year ago

I'm sorry, I'm having trouble understanding a couple of questions.

If the batchsize is too big, why is it still possible to run a full epoch? And the GPU memory usage is not high during training, is it possible to improve the GPU utilization during training?
If batchsize should be set to 4 for a GPU with 24G memory, should it be set to 2 for a 12G GPU?
In your paper run all experiments with 32 batch size on 8 NVIDIA V100 or NVIDIA A100. Then may I ask how long did you train for? What was the GPU utilization during the training?

jihanyang commented 1 year ago

We use batch size 32 on 8GPU to align with our baseline SparseUNet. If you want to change the default setting, you should adjust LR schedule and epoches by yourself to align results. We haven't checked GPU utilization carefully, but among 80-90%. The GPU utilization is influenced by CPU, IO and a lot of factors, so it varies device to device.

The GPU allocation varies during training because spconv allocates varying amounts of GPU memory, and the number of captions in each batch can also differ, leading to different GPU memory allocations. As a result, there may be occasional CUDA OOM errors if the batch size is set too large. Currently, we run with a batch size of 4 per GPU on 32GB V100 or 80GB A100 (primarily for instance segmentation in S3DIS). We haven't tested on other GPU machines to verify the GPU limit. The GPU utilization is high (around ~80-90%), and the average memory allocation is around 10GB+. However, I would not recommend setting a large batch size since the GPU memory changes over time.

CVMI-Lab / PLA

Training Issue:OSError: [Errno 24] Too many open files #20