Pointcept / Pointcept

Pointcept: a codebase for point cloud perception research. Latest works: PTv3 (CVPR'24 Oral), PPT (CVPR'24), OA-CNNs (CVPR'24), MSC (CVPR'23)
MIT License
1.6k stars 175 forks source link

DataLoader worker killed unexpectedly during testing #138

Open ome-13 opened 9 months ago

ome-13 commented 9 months ago

Hello GitHub community,

I am encountering an issue while running a test on my machine using the following command:

sh scripts/test.sh -p python -d s3dis -n semseg-pt-v2m2-0-base -w model_best -g 1

The error message suggests that the DataLoader worker is being killed unexpectedly, leading to a RuntimeError. Here are the relevant details:


Experiment name: semseg-pt-v2m2-0-base
Python interpreter dir: python
Dataset: s3dis
GPU Num: 1
Loading config in: exp/s3dis/semseg-pt-v2m2-0-base/config.py
Running code in: exp/s3dis/semseg-pt-v2m2-0-base/code
 =========> RUN TASK <=========
[2024-01-28 17:39:19,310 INFO test.py line 41 1845344] => Loading config ...
[2024-01-28 17:39:19,311 INFO test.py line 48 1845344] => Building model ...
[2024-01-28 17:39:19,365 INFO test.py line 61 1845344] Num params: 3908543
[2024-01-28 17:39:21,534 INFO test.py line 68 1845344] Loading weight at: exp/s3dis/semseg-pt-v2m2-0-base/model/model_best.pth
[2024-01-28 17:39:21,917 INFO test.py line 80 1845344] => Loaded weight 'exp/s3dis/semseg-pt-v2m2-0-base/model/model_best.pth' (epoch 70)
[2024-01-28 17:39:21,920 INFO test.py line 53 1845344] => Building test dataset & dataloader ...
[2024-01-28 17:39:21,921 INFO s3dis.py line 55 1845344] Totally 1 x 1 samples in Area_1 set.
[2024-01-28 17:39:21,922 INFO test.py line 119 1845344] >>>>>>>>>>>>>>>> Start Evaluation >>>>>>>>>>>>>>>>
Traceback (most recent call last):
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1845411) is killed by signal: Killed. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "tools/test.py", line 38, in <module>
    main()
  File "tools/test.py", line 27, in main
    launch(
  File "/home/ome13/Pointcept/pointcept/engines/launch.py", line 89, in launch
    main_func(*cfg)
  File "tools/test.py", line 20, in main_worker
    tester.test()
  File "/home/ome13/Pointcept/pointcept/engines/test.py", line 160, in test
    for idx, data_dict in enumerate(self.test_loader):
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data
    idx, data = self._get_data()
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1315, in _get_data
    success, data = self._try_get_data()
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1176, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1845411) exited unexpectedly

Any suggestions on how to diagnose and resolve this issue would be greatly appreciated. I have checked the dataset, and it seems to be loaded correctly. Are there any known issues with DataLoader on machines with similar specifications?

Thank you for your help!

Machine Specs:

Gofinge commented 9 months ago

Hi, I noticed that "Totally 1 x 1 samples in Area_1 set.", Is that an irregular?

ome-13 commented 9 months ago

Yes, this is due to that I am utilizing only one data sample from the 'Area_1' directory in this context.

DGCHAO commented 7 months ago

Hello GitHub community,

I am encountering an issue while running a test on my machine using the following command:

sh scripts/test.sh -p python -d s3dis -n semseg-pt-v2m2-0-base -w model_best -g 1

The error message suggests that the DataLoader worker is being killed unexpectedly, leading to a RuntimeError. Here are the relevant details:


Experiment name: semseg-pt-v2m2-0-base
Python interpreter dir: python
Dataset: s3dis
GPU Num: 1
Loading config in: exp/s3dis/semseg-pt-v2m2-0-base/config.py
Running code in: exp/s3dis/semseg-pt-v2m2-0-base/code
 =========> RUN TASK <=========
[2024-01-28 17:39:19,310 INFO test.py line 41 1845344] => Loading config ...
[2024-01-28 17:39:19,311 INFO test.py line 48 1845344] => Building model ...
[2024-01-28 17:39:19,365 INFO test.py line 61 1845344] Num params: 3908543
[2024-01-28 17:39:21,534 INFO test.py line 68 1845344] Loading weight at: exp/s3dis/semseg-pt-v2m2-0-base/model/model_best.pth
[2024-01-28 17:39:21,917 INFO test.py line 80 1845344] => Loaded weight 'exp/s3dis/semseg-pt-v2m2-0-base/model/model_best.pth' (epoch 70)
[2024-01-28 17:39:21,920 INFO test.py line 53 1845344] => Building test dataset & dataloader ...
[2024-01-28 17:39:21,921 INFO s3dis.py line 55 1845344] Totally 1 x 1 samples in Area_1 set.
[2024-01-28 17:39:21,922 INFO test.py line 119 1845344] >>>>>>>>>>>>>>>> Start Evaluation >>>>>>>>>>>>>>>>
Traceback (most recent call last):
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1845411) is killed by signal: Killed. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "tools/test.py", line 38, in <module>
    main()
  File "tools/test.py", line 27, in main
    launch(
  File "/home/ome13/Pointcept/pointcept/engines/launch.py", line 89, in launch
    main_func(*cfg)
  File "tools/test.py", line 20, in main_worker
    tester.test()
  File "/home/ome13/Pointcept/pointcept/engines/test.py", line 160, in test
    for idx, data_dict in enumerate(self.test_loader):
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data
    idx, data = self._get_data()
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1315, in _get_data
    success, data = self._try_get_data()
  File "/home/ome13/miniconda3/envs/pointcept/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1176, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1845411) exited unexpectedly

Any suggestions on how to diagnose and resolve this issue would be greatly appreciated. I have checked the dataset, and it seems to be loaded correctly. Are there any known issues with DataLoader on machines with similar specifications?

Thank you for your help!

Machine Specs:

  • Processor: 11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz, 8 cores, 16 logical processors
  • RAM: 32.0 GB
  • GPU: NVIDIA GeForce GTX 1080 Ti
  • CUDA Compiler Version: 11.3
  • GPU Memory: 11264MiB

I also encountered the same problem.

Gofinge commented 7 months ago

So, what's the batch size? If the batch size is even larger than the number of samples, an error may occur.

ome-13 commented 7 months ago

I reduced the batch size to one and am still getting that error.

DGCHAO commented 7 months ago

So, what's the batch size? If the batch size is even larger than the number of samples, an error may occur.

I set batch_size to 1 and num_work to 0.

Gofinge commented 7 months ago

So, just debugging, how about increasing the number of samples?

Novmaple commented 7 months ago

I meet the same problem. The memory usage continues to rise and the program is killed when it exceeds 100% usage.

Gofinge commented 7 months ago

I meet the same problem. The memory usage continues to rise and the program is killed when it exceeds 100% usage.

@DGCHAO @Novmaple Thanks for pointing out the reason. Yet, the batch size seems already reduced to 1, and the local RAM is 32G, it should be sufficient for running Pointcept. What about the size of the input point cloud?

Novmaple commented 7 months ago

My point cloud is small in size.

-20.919402 0.982379 0.328775 231 231 231
-20.917162 0.985604 0.331016 175 175 175
-20.909651 0.983475 0.331335 48 48 48
-20.908737 0.984242 0.331335 10 10 10
-20.911381 0.985221 0.331335 114 114 114
-20.91006 0.986302 0.331335 9 9 9

I find that I can run it successfully on one test set (Area 2 for example), but when I modify the config file to try to test it on another Area, this error occurs. This Area is truly speacial which accuracy is strongly higher than other areas in test/val set.

Miki-lin commented 4 months ago

My point cloud is small in size.

-20.919402 0.982379 0.328775 231 231 231
-20.917162 0.985604 0.331016 175 175 175
-20.909651 0.983475 0.331335 48 48 48
-20.908737 0.984242 0.331335 10 10 10
-20.911381 0.985221 0.331335 114 114 114
-20.91006 0.986302 0.331335 9 9 9

I find that I can run it successfully on one test set (Area 2 for example), but when I modify the config file to try to test it on another Area, this error occurs. This Area is truly speacial which accuracy is strongly higher than other areas in test/val set.

I meet the same probelem, this is due to the fact that the usage of the video memory has been increasing during the test, and I wonder if there is no memory cleaned up somewhere,if you fix it, please tell me how to do

Yatoronto commented 4 months ago

I have same problem.

vietpho commented 4 months ago

I'm having the same issue. Has anyone found a solution?

Gofinge commented 3 months ago

How about add some empty_cache() during testing?

Gofinge commented 3 months ago

Can someone who has the issue share some log?

blueeaglex commented 2 months ago

@Gofinge
I have the same problem and I got no more infomation but these logs:

Experiment name: semseg-pt-v2m2-0-base
Python interpreter dir: python
Dataset: s3dis
GPU Num: 1
Loading config in: exp/s3dis/semseg-pt-v2m2-0-base/config.py
Running code in: exp/s3dis/semseg-pt-v2m2-0-base/code
 =========> RUN TASK <=========
[2024-08-07 15:25:45,524 INFO test.py line 41 10836] => Loading config ...
[2024-08-07 15:25:45,524 INFO test.py line 48 10836] => Building model ...
[2024-08-07 15:25:45,561 INFO test.py line 61 10836] Num params: 3908641
[2024-08-07 15:25:46,313 INFO test.py line 68 10836] Loading weight at: exp/s3dis/semseg-pt-v2m2-0-base/model/model_best.pth
[2024-08-07 15:25:47,572 INFO test.py line 80 10836] => Loaded weight 'exp/s3dis/semseg-pt-v2m2-0-base/model/model_best.pth' (epoch 6)
[2024-08-07 15:25:47,575 INFO test.py line 53 10836] => Building test dataset & dataloader ...
[MYPRINT] data_root content:data/s3dis
[MYPRINT] split:area_3
[MYPRINT] I am here
[MYPRINT] data_list content: ['data/s3dis/area_3/conferenceRoom_1', 'data/s3dis/area_3/hallway_1', 'data/s3dis/area_3/hallway_2', 'data/s3dis/area_3/hallway_3', 'data/s3dis/area_3/hallway_4', 'data/s3dis/area_3/hallway_5', 'data/s3dis/area_3/hallway_6', 'data/s3dis/area_3/lounge_1', 'data/s3dis/area_3/lounge_2', 'data/s3dis/area_3/office_1', 'data/s3dis/area_3/office_10', 'data/s3dis/area_3/office_2', 'data/s3dis/area_3/office_3', 'data/s3dis/area_3/office_4', 'data/s3dis/area_3/office_5', 'data/s3dis/area_3/office_6', 'data/s3dis/area_3/office_7', 'data/s3dis/area_3/office_8', 'data/s3dis/area_3/office_9', 'data/s3dis/area_3/storage_1', 'data/s3dis/area_3/storage_2', 'data/s3dis/area_3/WC_1', 'data/s3dis/area_3/WC_2']
[2024-08-07 15:25:47,582 INFO defaults.py line 70 10836] Totally 23 x 1 samples in area_3 set.
[2024-08-07 15:25:47,583 INFO test.py line 119 10836] >>>>>>>>>>>>>>>> Start Evaluation >>>>>>>>>>>>>>>>
[MYPRINT] data_path: data/s3dis/area_3/conferenceRoom_1
[MYPRINT] data_path: data/s3dis/area_3/hallway_1
/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/pointops-1.0-py3.8-linux-x86_64.egg/pointops/query.py:19: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392020195/work/torch/csrc/tensor/python_tensor.cpp:83.)
  idx = torch.cuda.IntTensor(m, nsample).zero_()
[2024-08-07 15:26:03,333 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 0/516
[2024-08-07 15:26:03,644 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 1/516
[2024-08-07 15:26:03,956 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 2/516
[2024-08-07 15:26:04,263 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 3/516
[2024-08-07 15:26:04,572 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 4/516
[2024-08-07 15:26:04,886 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 5/516
[MYPRINT] data_path: data/s3dis/area_3/hallway_2
[2024-08-07 15:26:05,197 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 6/516
[2024-08-07 15:26:05,509 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 7/516
[2024-08-07 15:26:05,827 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 8/516
[2024-08-07 15:26:06,145 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 9/516
[2024-08-07 15:26:06,478 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 10/516
[2024-08-07 15:26:06,836 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 11/516
[2024-08-07 15:26:07,188 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 12/516
[2024-08-07 15:26:07,802 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 13/516
[2024-08-07 15:26:08,529 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 14/516
[2024-08-07 15:26:09,212 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 15/516
[2024-08-07 15:26:09,943 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 16/516
[2024-08-07 15:26:10,548 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 17/516
[2024-08-07 15:26:10,877 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 18/516
[2024-08-07 15:26:11,188 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 19/516
[2024-08-07 15:26:11,496 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 20/516
[2024-08-07 15:26:11,808 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 21/516
[2024-08-07 15:26:12,117 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 22/516
[2024-08-07 15:26:12,429 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 23/516
[2024-08-07 15:26:12,740 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 24/516
[2024-08-07 15:26:13,052 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 25/516
[2024-08-07 15:26:13,361 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 26/516
[2024-08-07 15:26:13,675 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 27/516
[2024-08-07 15:26:13,985 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 28/516
[2024-08-07 15:26:14,292 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 29/516
[2024-08-07 15:26:14,606 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 30/516
[2024-08-07 15:26:14,923 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 31/516
[2024-08-07 15:26:15,236 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 32/516
[2024-08-07 15:26:15,550 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 33/516
[2024-08-07 15:26:16,204 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 34/516
[2024-08-07 15:26:16,758 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 35/516
[2024-08-07 15:26:17,077 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 36/516
[2024-08-07 15:26:17,396 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 37/516
[2024-08-07 15:26:17,715 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 38/516
[2024-08-07 15:26:18,037 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 39/516
[2024-08-07 15:26:18,355 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 40/516
[2024-08-07 15:26:18,673 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 41/516
[2024-08-07 15:26:18,989 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 42/516
[2024-08-07 15:26:19,306 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 43/516
[2024-08-07 15:26:19,625 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 44/516
[2024-08-07 15:26:19,946 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 45/516
[2024-08-07 15:26:20,264 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 46/516
[2024-08-07 15:26:20,577 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 47/516
[2024-08-07 15:26:20,894 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 48/516
[2024-08-07 15:26:21,206 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 49/516
[2024-08-07 15:26:21,527 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 50/516
[2024-08-07 15:26:21,845 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 51/516
[2024-08-07 15:26:22,165 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 52/516
[2024-08-07 15:26:22,478 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 53/516
[2024-08-07 15:26:22,797 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 54/516
[2024-08-07 15:26:23,114 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 55/516
[2024-08-07 15:26:23,428 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 56/516
[2024-08-07 15:26:23,748 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 57/516
[2024-08-07 15:26:24,071 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 58/516
[2024-08-07 15:26:24,384 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 59/516
[2024-08-07 15:26:24,701 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 60/516
[2024-08-07 15:26:25,114 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 61/516
[2024-08-07 15:26:25,493 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 62/516
[2024-08-07 15:26:25,876 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 63/516
[2024-08-07 15:26:26,248 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 64/516
[2024-08-07 15:26:26,628 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 65/516
[2024-08-07 15:26:27,001 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 66/516
[2024-08-07 15:26:27,374 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 67/516
[2024-08-07 15:26:27,778 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 68/516
[2024-08-07 15:26:28,148 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 69/516
[2024-08-07 15:26:28,523 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 70/516
[2024-08-07 15:26:28,899 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 71/516
[2024-08-07 15:26:29,261 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 72/516
[2024-08-07 15:26:29,634 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 73/516
[2024-08-07 15:26:30,006 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 74/516
[2024-08-07 15:26:30,378 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 75/516
[2024-08-07 15:26:30,759 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 76/516
[2024-08-07 15:26:31,663 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 77/516
[2024-08-07 15:26:32,075 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 78/516
[2024-08-07 15:26:32,493 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 79/516
[2024-08-07 15:26:33,261 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 80/516
[2024-08-07 15:26:33,828 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 81/516
[2024-08-07 15:26:34,203 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 82/516
[2024-08-07 15:26:34,631 INFO test.py line 199 10836] Test: 1/23-area_3-conferenceRoom_1, Batch: 83/516
Traceback (most recent call last):
  File "tools/test.py", line 38, in <module>
    main()
  File "tools/test.py", line 27, in main
    launch(
  File "/mnt/c/workspace/Pointcept/pointcept/engines/launch.py", line 89, in launch
    main_func(*cfg)
  File "tools/test.py", line 20, in main_worker
    tester.test()
  File "/mnt/c/workspace/Pointcept/pointcept/engines/test.py", line 190, in test
    pred_part = self.model(input_dict)["seg_logits"]  # (n, k)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/c/workspace/Pointcept/pointcept/models/default.py", line 21, in forward
    seg_logits = self.backbone(input_dict)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/c/workspace/Pointcept/pointcept/models/point_transformer_v2/point_transformer_v2m2_base.py", line 563, in forward
    points = self.patch_embed(points)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/c/workspace/Pointcept/pointcept/models/point_transformer_v2/point_transformer_v2m2_base.py", line 444, in forward
    return self.blocks([coord, feat, offset])
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/c/workspace/Pointcept/pointcept/models/point_transformer_v2/point_transformer_v2m2_base.py", line 225, in forward
    points = block(points, reference_index)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/c/workspace/Pointcept/pointcept/models/point_transformer_v2/point_transformer_v2m2_base.py", line 169, in forward
    self.attn(feat, coord, reference_index)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/c/workspace/Pointcept/pointcept/models/point_transformer_v2/point_transformer_v2m2_base.py", line 118, in forward
    relation_qk = relation_qk + peb
  File "/home/owl/anaconda3/envs/PTv3/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 10915) is killed by signal: Killed. 

I tried to reduce batch size or number of workers but it didn't work.

Gofinge commented 2 months ago

same problem and I got no more info

I think it is caused by OOM. I didn't add an efficient testing config for S3DIS, but you can refer this "https://github.com/Pointcept/Pointcept/blob/main/configs/nuscenes/semseg-pt-v3m1-0-base.py#L158-L178"

Make the grid size of the first grid sampling half of the second grid sampling.