V2AI / Det3D

World's first general purpose 3D object detection codebse.
https://arxiv.org/abs/1908.09492
Apache License 2.0
1.48k stars 299 forks source link

PointPillars died with <Signals.SIGFPE: 8> #125

Closed Chi-Zaozao closed 3 years ago

Chi-Zaozao commented 4 years ago

When I train pointpillars on my own dataset, I met this problem. I really cannot handle it.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/workspace/det3d_requirement/Det3D/tools/train.py', '--local_rank=0', '/workspace/det3d_requirement/Research/deep_km500/codes/simit_km500_pillar.py', '--work_dir=/workspace/det3d_requirement/Research/deep_km500/outputs/SIMIT_KM500_PILLAR_lucky_lr3e-4_v444_wd1e-3_neg1_20200723-083639']' died with <Signals.SIGFPE: 8>.

I will be very appreciated if you can give me any suggestion.

poodarchu commented 4 years ago

the logging message provides no useful information.

Chi-Zaozao commented 4 years ago

The full logging message is as follows:

2020-07-23 08:49:02,306 - INFO - Distributed training: False
2020-07-23 08:49:02,306 - INFO - torch.backends.cudnn.benchmark: False
2020-07-23 08:49:02,306 - INFO - building model...
2020-07-23 08:49:02,355 - INFO - Finish RPN Initialization
2020-07-23 08:49:02,355 - INFO - num_classes: [4], num_preds: [56], num_dirs: [16]
2020-07-23 08:49:02,356 - INFO - Finish MultiGroupHead Initialization
2020-07-23 08:49:02,356 - INFO - model already been built
2020-07-23 08:49:02,356 - INFO - building datasets...
2020-07-23 08:49:02,372 - INFO - {'concealed1': 4, 'concealed2': 4, 'concealed3': 4, 'concealed4': 4}
2020-07-23 08:49:02,373 - INFO - [-1]
2020-07-23 08:49:02,379 - INFO - load 428 concealed1 database infos
2020-07-23 08:49:02,379 - INFO - load 117 concealed2 database infos
2020-07-23 08:49:02,379 - INFO - load 450 concealed3 database infos
2020-07-23 08:49:02,379 - INFO - load 208 concealed4 database infos
2020-07-23 08:49:02,382 - INFO - After filter database:
2020-07-23 08:49:02,382 - INFO - load 420 concealed1 database infos
2020-07-23 08:49:02,382 - INFO - load 116 concealed2 database infos
2020-07-23 08:49:02,382 - INFO - load 447 concealed3 database infos
2020-07-23 08:49:02,382 - INFO - load 208 concealed4 database infos
2020-07-23 08:49:02,382 - INFO - datasets already been built
2020-07-23 08:49:02,387 - INFO - starting train detector...
total_steps: 250000
len(data_loaders[0]): 500
2020-07-23 08:49:03,917 - INFO - model structure: PointPillars(
  (reader): PillarFeatureNet(
    (pfn_layers): ModuleList(
      (0): PFNLayer(
        (linear): Linear(in_features=9, out_features=64, bias=False)
        (norm): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      )
    )
  )
  (backbone): PointPillarsScatter()
  (neck): RPN(
    (blocks): ModuleList(
      (0): Sequential(
        (0): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0)
        (1): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), bias=False)
        (2): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (3): ReLU()
        (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (5): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (6): ReLU()
        (7): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (8): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (9): ReLU()
        (10): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (11): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (12): ReLU()
      )
      (1): Sequential(
        (0): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0)
        (1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), bias=False)
        (2): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (3): ReLU()
        (4): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (5): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (6): ReLU()
        (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (8): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (9): ReLU()
        (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (11): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (12): ReLU()
        (13): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (14): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (15): ReLU()
        (16): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (17): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (18): ReLU()
      )
      (2): Sequential(
        (0): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0)
        (1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), bias=False)
        (2): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (3): ReLU()
        (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (5): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (6): ReLU()
        (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (8): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (9): ReLU()
        (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (11): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (12): ReLU()
        (13): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (14): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (15): ReLU()
        (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (17): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (18): ReLU()
      )
    )
    (deblocks): ModuleList(
      (0): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (2): ReLU()
      )
      (1): Sequential(
        (0): ConvTranspose2d(128, 128, kernel_size=(2, 2), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (2): ReLU()
      )
      (2): Sequential(
        (0): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(4, 4), bias=False)
        (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (2): ReLU()
      )
    )
  )
  (bbox_head): MultiGroupHead(
    (loss_cls): SigmoidFocalLoss()
    (loss_reg): WeightedSmoothL1Loss()
    (loss_aux): WeightedSoftmaxClassificationLoss()
    (tasks): ModuleList(
      (0): Head(
        (conv_box): Conv2d(384, 56, kernel_size=(1, 1), stride=(1, 1))
        (conv_cls): Conv2d(384, 32, kernel_size=(1, 1), stride=(1, 1))
        (conv_dir): Conv2d(384, 16, kernel_size=(1, 1), stride=(1, 1))
      )
    )
  )
)
2020-07-23 08:49:03,918 - INFO - building trainer...
2020-07-23 08:49:03,918 - INFO - trainer already been built
2020-07-23 08:49:03,918 - INFO - trainer registering hooks...
2020-07-23 08:49:03,918 - INFO - hooks already been built
2020-07-23 08:49:03,918 - INFO - start running trainer...
2020-07-23 08:49:03,919 - INFO - Start running, host: root@5ad3f45114dc, work_dir: /workspace/det3d_requirement/Research/deep_km500/outputs/SIMIT_KM500_PILLAR_lucky_lr3e-4_v444_wd1e-3_neg1_20200723-084854
2020-07-23 08:49:03,919 - INFO - workflow: [('train', 1), ('val', 1)], max: 500 epochs
/workspace/det3d_requirement/Det3D/det3d/core/sampler/preprocess.py:464: NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float32, 2d, A), array(float32, 2d, C))
  points[i : i + 1, :3] = points[i : i + 1, :3] @ rot_mat_T[j]
/opt/conda/lib/python3.6/site-packages/numba-0.48.0-py3.6-linux-x86_64.egg/numba/typing/npydecl.py:958: NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float32, 2d, A), array(float32, 2d, C))
  warnings.warn(NumbaPerformanceWarning(msg))
/workspace/det3d_requirement/Det3D/det3d/core/bbox/geometry.py:387: NumbaWarning: 
Compilation is falling back to object mode WITH looplifting enabled because Function "points_in_convex_polygon_jit" failed type inference due to: Invalid use of Function(<built-in function getitem>) with argument(s) of type(s): (array(float32, 3d, C), Tuple(slice<a:b>, list(int64), slice<a:b>))
 * parameterized
In definition 0:
    All templates rejected with literals.
In definition 1:
    All templates rejected without literals.
In definition 2:
    All templates rejected with literals.
In definition 3:
    All templates rejected without literals.
In definition 4:
    All templates rejected with literals.
In definition 5:
    All templates rejected without literals.
In definition 6:
    All templates rejected with literals.
In definition 7:
    All templates rejected without literals.
In definition 8:
    All templates rejected with literals.
In definition 9:
    All templates rejected without literals.
In definition 10:
    All templates rejected with literals.
In definition 11:
    All templates rejected without literals.
In definition 12:
    TypeError: unsupported array index type list(int64) in Tuple(slice<a:b>, list(int64), slice<a:b>)
    raised from /opt/conda/lib/python3.6/site-packages/numba-0.48.0-py3.6-linux-x86_64.egg/numba/typing/arraydecl.py:71
In definition 13:
    TypeError: unsupported array index type list(int64) in Tuple(slice<a:b>, list(int64), slice<a:b>)
    raised from /opt/conda/lib/python3.6/site-packages/numba-0.48.0-py3.6-linux-x86_64.egg/numba/typing/arraydecl.py:71
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: typing of intrinsic-call at /workspace/det3d_requirement/Det3D/det3d/core/bbox/geometry.py (406)

File "../../../Det3D/det3d/core/bbox/geometry.py", line 406:
def points_in_convex_polygon_jit(points, polygon, clockwise=True):
    <source elided>
                :,
                [num_points_of_polygon - 1] + list(range(num_points_of_polygon - 1)),
                ^

  @numba.jit
2020-07-23 08:49:13,659 - INFO - finding looplift candidates
/workspace/det3d_requirement/Det3D/det3d/core/bbox/geometry.py:387: NumbaWarning: 
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "points_in_convex_polygon_jit" failed type inference due to: cannot determine Numba type of <class 'numba.dispatcher.LiftedLoop'>

File "../../../Det3D/det3d/core/bbox/geometry.py", line 423:
def points_in_convex_polygon_jit(points, polygon, clockwise=True):
    <source elided>
    cross = 0.0
    for i in range(num_points):
    ^

  @numba.jit
/opt/conda/lib/python3.6/site-packages/numba-0.48.0-py3.6-linux-x86_64.egg/numba/object_mode_passes.py:178: NumbaWarning: Function "points_in_convex_polygon_jit" was compiled in object mode without forceobj=True, but has lifted loops.

File "../../../Det3D/det3d/core/bbox/geometry.py", line 398:
def points_in_convex_polygon_jit(points, polygon, clockwise=True):
    <source elided>
    # first convert polygon to directed lines
    num_points_of_polygon = polygon.shape[1]
    ^

  state.func_ir.loc))
/opt/conda/lib/python3.6/site-packages/numba-0.48.0-py3.6-linux-x86_64.egg/numba/object_mode_passes.py:188: NumbaDeprecationWarning: 
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "../../../Det3D/det3d/core/bbox/geometry.py", line 398:
def points_in_convex_polygon_jit(points, polygon, clockwise=True):
    <source elided>
    # first convert polygon to directed lines
    num_points_of_polygon = polygon.shape[1]
    ^

  state.func_ir.loc))
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/workspace/det3d_requirement/Det3D/tools/train.py', '--local_rank=0', '/workspace/det3d_requirement/Research/deep_km500/codes/simit_km500_pillar.py', '--work_dir=/workspace/det3d_requirement/Research/deep_km500/outputs/SIMIT_KM500_PILLAR_lucky_lr3e-4_v444_wd1e-3_neg1_20200723-084854']' died with <Signals.SIGFPE: 8>.
poodarchu commented 4 years ago

you can replace

/workspace/det3d_requirement/Det3D/det3d/core/sampler/preprocess.py:464: NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float32, 2d, A), array(float32, 2d, C)) points[i : i + 1, :3] = points[i : i + 1, :3] @ rot_mat_T[j] /opt/conda/lib/python3.6/site-packages/numba-0.48.0-py3.6-linux-x86_64.egg/numba/typing/npydecl.py:958: NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float32, 2d, A), array(float32, 2d, C))

with standard matrix multiplication.

Chi-Zaozao commented 4 years ago

I've comment @numba.jit but it didn't work. I found error occurs here(det3d/models/readers/pillar_enconder.py line 47):

    x = self.norm(x.permute(0, 2, 1).contiguous()).permute(0, 2, 1).contiguous()

And It works well on kitti dataset

Chi-Zaozao commented 3 years ago

It seems error occurs when processing a tensor with 0 element.

Chi-Zaozao commented 3 years ago

Det3D filps the input point cloud randomly, while I didn't set the right range.