Closed wangjuansan closed 1 year ago
Yes, I have met similar problem, but it should be already addressed in our code. Have you modified any codes?
Thank you for your reply! I didn't change the code. BTW, May I ask when you ran into this problem, what parts of the code did you change to solve the problem?
Hello, It seems that this is caused by a implicitly deep copy of a list in the dataloader. For example, in OpenPCDet, https://github.com/open-mmlab/OpenPCDet/blob/bce886d6e36e3deaec2ce6f8b54b133fcadf62d6/pcdet/datasets/kitti/kitti_dataset.py#L376
They need to do a explicitly deepcopy instead of directly get the info from the info list.
As for our PLA, since I cannot reproduce this problem now. I think you can first try to run https://github.com/CVMI-Lab/PLA/blob/main/tools/cfgs/scannet_models/spconv_clip_adamw.yaml this cfg. And check whether this problem exist. If this problem disappear, the problem should caused by loading caption files.
Yes, I run python train.py --cfg_file cfgs/scannet_models/spconv_clip_adamw.yaml
today and this problem didn't occur.
Should I change this line? https://github.com/CVMI-Lab/PLA/blob/3d8494b9499c9f7f2b1ed869fd7f57c960042a1a/pcseg/datasets/indoor_dataset.py#L201C45-L201C45
This line already use deep copy, so it is not supposed to cause this problem. But I think some attempts is worthy, since the problem is caused by load captions and caption idx.
Since we cannot reproduce this problem for now, I can only provide some possible solution. Here is a possible manner: in my experience, this error will take place in a specific iteration, as the limit of opening files is a scalar. You can find that number by checking which iteration it will stop, and then set a breakpoint in that iteration, checking which line of loading code results in problem.
You can also check this issue https://github.com/pytorch/pytorch/issues/11201. If you use a personal computer, this should work in my experience, but this solution does not work on the cluster that I used.
Thank you for your help. I added the follwing code and it worked.
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
Thank you for your help. I added the follwing code and it worked.
import torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')
But it will be out of memory at some epoch with this method
2023-08-31 20:27:52,687 INFO Epoch [5/128][240/400] LR: 0.004, ETA: 3 days, 4:36:35, Data: 0.15 (0.14), Iter: 7.69 (5.59), Accuracy: 0.76, loss_seg=0.71, binary_loss=0.31, caption_view=0.20, caption_entity=0.19, loss=1.41, n_captions=17.3(12.8)
2023-08-31 20:29:48,883 INFO Epoch [5/128][260/400] LR: 0.004, ETA: 3 days, 4:48:50, Data: 0.10 (0.14), Iter: 6.13 (5.60), Accuracy: 0.78, loss_seg=0.65, binary_loss=0.20, caption_view=0.19, caption_entity=0.18, loss=1.21, n_captions=13.8(12.9)
2023-08-31 20:31:38,831 INFO Epoch [5/128][280/400] LR: 0.004, ETA: 3 days, 4:40:42, Data: 0.10 (0.13), Iter: 5.31 (5.60), Accuracy: 0.83, loss_seg=0.53, binary_loss=0.28, caption_view=0.18, caption_entity=0.17, loss=1.16, n_captions=13.4(12.9)
Traceback (most recent call last):
File "train.py", line 260, in <module>
if __name__ == '__main__':
File "train.py", line 255, in main
text_encoder=text_encoder,
File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 207, in train_model
text_encoder=text_encoder
File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 92, in train_one_epoch
scaler.scale(loss).backward()
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 2.16 GiB (GPU 0; 23.70 GiB total capacity; 17.34 GiB already allocated; 1.88 GiB free; 18.27 GiB reserved in total by PyTorch)
The breakpoint stop at https://github.com/CVMI-Lab/PLA/blob/3d8494b9499c9f7f2b1ed869fd7f57c960042a1a/tools/train_utils/train_utils.py#L47, cur_it = 27
. total_it_each_epoch = 600
,
scaler = torch.cuda.amp.GradScaler(enabled=args.use_amp for cur_it in range(total_it_each_epoch):
try:
batch = next(dataloader_iter)
except StopIteration:
dataloader_iter = iter(train_loader)
batch = next(dataloader_iter)
print('new iters')
batch['epoch'] = cur_epoch
data_timer = time.time()
cur_data_time = data_timer - end
For the out of memory situation, may I know your batch size and GPU memory?
For the breakpoint, you should step into your dataloader getitem function, especially, the loading of caption files.
In fact, I tried several times different combinations of batch size and GPU memory. batchsize = 16,14,12. and GPU 48G. batchsize = 12,8. and GPU 24G.
When I set ulimit -n 10240
, and added the code
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
I found that the code would have this problem at the 2nd or 5th epoch of training.I don't think this should have anything to do with GPU memory, because with this memory and batch size setting, the code can train a full epoch and it shouldn't be caused by a lack of memory.
When I set ulimit -n 1024
as default, bachsize=8,GPU 24G, remove
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
the code quickly reported an error in the first epoch.
2023-09-01 10:45:37,915 INFO **********************Start training scannet_models/spconv_clip_base15_caption_adamw(default)**********************
2023-09-01 10:56:17,736 INFO Epoch [1/128][20/600] LR: 0.004, ETA: 27 days, 23:45:32, Data: 2.11 (26.34), Iter: 5.29 (31.50), Accuracy: 0.50, loss_seg=1.77, binary_loss=0.47, caption_view=0.22, caption_entity=0.21, loss=2.66, n_captions=17.4(11.9)
Traceback (most recent call last):
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 149, in _serve
send(conn, destination_pid)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 179, in send_handle
with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/socket.py", line 463, in fromfd
nfd = dup(fd)
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Exception in thread Thread-8:
Traceback (most recent call last):
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
return recvfds(s, 1)[0]
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError
For the breakpoint, could you please give me the locations of the code that need to be step into? Because I'm not farmiliar with your code and it did take me a lot of time to locate it.Thank you very much.
Hello, your previous link is correct, you should mainly focus this function: https://github.com/CVMI-Lab/PLA/blob/3d8494b9499c9f7f2b1ed869fd7f57c960042a1a/pcseg/datasets/indoor_dataset.py#L196
Sorry, it is a little bit hard to figure out the error for me. I have tried many times and every time when I debug it is stuck .
At first, I got OSError: [Errno 24] Too many open files
and RuntimeError: unable to open shared memory object </torch_235148_4273977650> in read-write mode
.
2023-09-04 18:47:30,763 INFO Epoch [1/128][20/600] LR: 0.004, ETA: 6 days, 23:36:13, Data: 0.08 (0.47), Iter: 3.16 (7.86), Accuracy: 0.57, loss_seg=1.48, binary_loss=0.49, caption_view=0.20, caption_entity=0.16, loss=2.33, n_captions=11.6(12.5)
Traceback (most recent call last):
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_235148_4273977650> in read-write mode
Traceback (most recent call last):
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 322, in reduce_storage
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/reduction.py", line 194, in DupFd
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/multiprocessing/resource_sharer.py", line 48, in __init__
OSError: [Errno 24] Too many open files
But if I add
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
I got RuntimeError: CUDA out of memory.
Obviously, it not because of the cuda memory. Because when I reduce batchsize=4 or 2, I will also get this error, but in 4th or 7th.
2023-09-04 17:59:04,135 INFO Epoch [1/128][20/600] LR: 0.004, ETA: 9 days, 1:48:45, Data: 0.13 (0.47), Iter: 6.44 (10.21), Accuracy: 0.56, loss_seg=1.51, binary_loss=0.51, caption_view=0.19, caption_entity=0.18, loss=2.39, n_captions=11.0(12.5)
2023-09-04 18:01:05,994 INFO Epoch [1/128][40/600] LR: 0.004, ETA: 7 days, 5:48:45, Data: 0.10 (0.29), Iter: 6.13 (8.15), Accuracy: 0.61, loss_seg=1.29, binary_loss=0.47, caption_view=0.20, caption_entity=0.18, loss=2.14, n_captions=15.2(12.7)
2023-09-04 18:03:11,054 INFO Epoch [1/128][60/600] LR: 0.004, ETA: 6 days, 16:17:21, Data: 0.14 (0.22), Iter: 5.43 (7.52), Accuracy: 0.63, loss_seg=1.29, binary_loss=0.32, caption_view=0.18, caption_entity=0.15, loss=1.94, n_captions=7.4(12.6)
2023-09-04 18:05:14,054 INFO Epoch [1/128][80/600] LR: 0.004, ETA: 6 days, 8:57:07, Data: 0.12 (0.19), Iter: 6.72 (7.18), Accuracy: 0.63, loss_seg=1.18, binary_loss=0.30, caption_view=0.20, caption_entity=0.17, loss=1.85, n_captions=13.1(12.5)
2023-09-04 18:07:22,827 INFO Epoch [1/128][100/600] LR: 0.004, ETA: 6 days, 5:46:11, Data: 0.10 (0.17), Iter: 6.16 (7.03), Accuracy: 0.54, loss_seg=1.45, binary_loss=0.46, caption_view=0.19, caption_entity=0.16, loss=2.26, n_captions=10.4(12.7)
2023-09-04 18:09:18,085 INFO Epoch [1/128][120/600] LR: 0.004, ETA: 6 days, 1:13:48, Data: 0.09 (0.16), Iter: 6.99 (6.82), Accuracy: 0.71, loss_seg=1.05, binary_loss=0.40, caption_view=0.20, caption_entity=0.18, loss=1.82, n_captions=13.4(12.6)
2023-09-04 18:11:29,505 INFO Epoch [1/128][140/600] LR: 0.004, ETA: 6 days, 0:26:09, Data: 0.08 (0.15), Iter: 5.18 (6.78), Accuracy: 0.67, loss_seg=1.07, binary_loss=0.32, caption_view=0.20, caption_entity=0.18, loss=1.77, n_captions=12.4(12.7)
2023-09-04 18:13:37,711 INFO Epoch [1/128][160/600] LR: 0.004, ETA: 5 days, 23:24:20, Data: 0.14 (0.14), Iter: 7.65 (6.74), Accuracy: 0.69, loss_seg=1.04, binary_loss=0.28, caption_view=0.19, caption_entity=0.14, loss=1.66, n_captions=11.9(12.9)
Traceback (most recent call last):
File "train.py", line 260, in <module>
main()
File "train.py", line 255, in main
arnold=arnold
File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 207, in train_model
text_encoder=text_encoder
File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 92, in train_one_epoch
scaler.scale(loss).backward()
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.44 GiB (GPU 0; 23.70 GiB total capacity; 11.56 GiB already allocated; 1.23 GiB free; 12.34 GiB reserved in total by PyTorch)
Here are my guesses.
I'm wondering if it's caused by duplicate caption files loading.For example, constantly opening a file without closing it or constantly loading it into memory but not emptying it, which leads to the problem not occurring when the sample is small, but only when the sample is large.
Because when the batchsie is small, the training can even run through a complete epoch. This means that the memory occupied during the first epoch is not completely freed or the files are not closed, resulting in the second epoch to continue to add to the memory or continue to open the file, which ultimately leads to this problem.
BTW, when you last encountered this problem, how did you locate it and what caused it? I know he was loading caption to cause it, but could you please give more specific information?(If you can't remember, maybe look up the git commit history.)
Looking forward to your reply.
Hi, we encountered this issue while indexing the corresponding caption and resolved it by adding 'copy.deepcopy'. https://github.com/CVMI-Lab/PLA/blob/3d8494b9499c9f7f2b1ed869fd7f57c960042a1a/pcseg/datasets/indoor_dataset.py#L201. Currently, we are unable to reproduce this error, but we will investigate the cause at a later time.
Hi, thank you for your reply. If deepcopy resolved this problem. Then I think it didn't work in my training. Because I set breakpoint in line 201 "cur_caption_idx = copy.deepcopy(self.scene_image_corr_infos[index]) ", but it didn't get into it. It get into if isinstance(self.scene_image_corr_infos, dict)
, but not else
, then the occurs.
if hasattr(self, 'scene_image_corr_infos') and self.scene_image_corr_infos is not None:
if isinstance(self.scene_image_corr_infos, dict):
# assert scene_name in self.scene_image_corr_infos
info = self.scene_image_corr_infos.get(scene_name, {})
else:
cur_caption_idx = copy.deepcopy(self.scene_image_corr_infos[index])
assert scene_name == cur_caption_idx['scene_name']
info = cur_caption_idx['infos']
if len(info) > 0:
image_name_view, image_corr_view = zip(*info.items())
else:
image_name_view, image_corr_view = [], []
image_name_dict['view'] = image_name_view
image_corr_dict['view'] = image_corr_view
Hi, in that case, could you please try wrapping the self.scene_image_corr_infos.get(scene_name, {})
with copy.deepcopy
function?
Hi, thank you for your help. I followed your advice and wrapping self.scene_image_corr_infos.get(scene_name, {})
and self.scene_image_corr_entity_infos.get(scene_name, {})
with copy.deepcopy
. Now it's not reporting errors, I'll wait for the training to complete to see if the issue recurs.
Hi, batchsize=12. GPU 24G. python train.py --cfg_file cfgs/scannet_models/spconv_clip_base15_caption_adamw.yaml
The training stop at 32th epoch with the error of "CUDA out of memory.". Are there other places in the code that need to be changed as well?
2023-09-06 08:08:17,913 INFO Epoch [32/128][160/400] LR: 0.004, ETA: 2 days, 10:44:09, Data: 0.07 (0.10), Iter: 3.72 (5.47), Accuracy: 0.89, loss_seg=0.30, binary_loss=0.22, caption_view=0.15, caption_entity=0.13, loss=0.79, n_captions=7.8(13.0)
Traceback (most recent call last):
File "train.py", line 261, in <module>
main()
File "train.py", line 256, in main
arnold=arnold
File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 207, in train_model
text_encoder=text_encoder
File "/var/autofs/home/hale/usrs/wangjuan/code/PLA/tools/train_utils/train_utils.py", line 92, in train_one_epoch
scaler.scale(loss).backward()
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/usrs/wangjuan/anaconda3/envs/PLA/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 2.39 GiB (GPU 0; 23.70 GiB total capacity; 19.30 GiB already allocated; 1.97 GiB free; 20.02 GiB reserved in total by PyTorch)
The batch size is set too large. Maybe you can try 4 batch sizes per GPU.
I'm sorry, I'm having trouble understanding a couple of questions.
run all experiments with 32 batch size on 8 NVIDIA V100 or NVIDIA A100
. Then may I ask how long did you train for? What was the GPU utilization during the training?We use batch size 32 on 8GPU to align with our baseline SparseUNet. If you want to change the default setting, you should adjust LR schedule and epoches by yourself to align results. We haven't checked GPU utilization carefully, but among 80-90%. The GPU utilization is influenced by CPU, IO and a lot of factors, so it varies device to device.
The GPU allocation varies during training because spconv allocates varying amounts of GPU memory, and the number of captions in each batch can also differ, leading to different GPU memory allocations. As a result, there may be occasional CUDA OOM errors if the batch size is set too large. Currently, we run with a batch size of 4 per GPU on 32GB V100 or 80GB A100 (primarily for instance segmentation in S3DIS). We haven't tested on other GPU machines to verify the GPU limit. The GPU utilization is high (around ~80-90%), and the average memory allocation is around 10GB+. However, I would not recommend setting a large batch size since the GPU memory changes over time.
Hi~ When I run
python train.py --cfg_file cfgs/scannet_models/spconv_clip_base15_caption_adamw.yaml
,Then, I reduced the num_workers and the batch_size
I guess it is because the open file limited.So I set
ulimit -n 10240
.But I still got
Have you ever had a similar problem? Looking forward to your reply!