PostProcessCuda is very slow using my model

GuillaumeAnoufa commented 1 year ago

System: Ubuntu 20.04 Last version of OpenPcDet GPU has cuda devices: 1 ----device id: 0 info---- GPU : NVIDIA GeForce RTX 2080 with Max-Q Design Capbility: 7.5 Global memory: 7982MB Const memory: 64KB SM in a block: 48KB warp size: 32 threads in a block: 1024 block dim: (1024,1024,64) grid dim: (2147483647,65535,65535)

Hello,

I exported my pointpillar weights trained on custom data. The only change compared to the example model in parameters is the fact that it only uses 1 class instead of 3. I had to change a few things in tools/simplifier_onnx.py for the exporter to work with other than 3 classes:

Code changes to work with 1 class I changed the signature of simplify_postprocess(onnx_model) to simplify_postprocess(onnx_model, num_classes) and changed 3 other lines.

-  cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 248, 216, 18))
-  box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 248, 216, 42))
-  dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 248, 216, 12))
+  cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 248, 216, 2 * num_classes * num_classes))
+  box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 248, 216, 14 * num_classes))
+  dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 248, 216, 4 * num_classes))

The exporter works but when testing the demo with this model: ---- RUN TIME ---- load file: ../data/data_velo/000001.bin find points num: 18630 find pillar_num: 6815 TIME: generateVoxels: 0.038048 ms. TIME: generateFeatures: 0.053024 ms. TIME: doinfer: 30.2525 ms. TIME: doPostprocessCuda: 57528.1 ms. TIME: pointpillar: 57558.6 ms. Bndbox objs: 4158 Saved prediction in: ../eval/kitti/object/pred_velo/000001.txt

This model works perfectly fine in pytorch.

As you can see the post process part takes a long time and outputs thousands of bounding boxes. Issue #43 references a similar problem seemingly solved by an update but I am currently using the most updated version of this repo.

Do you have an idea what could cause this issue?

I can upload my .pth file or my onnx file if you want to try and reproduce this.

Best regards,

mazm0002 commented 1 year ago

Have you found a solution to this? I'm having a similar issue using my own model with one class, except it just gets stuck on inference (after the find pillar_num line). I've also noticed one of the cores on the Xavier is being maxed out while this is happening.

GuillaumeAnoufa commented 1 year ago

Have you found a solution to this? I'm having a similar issue using my own model with one class, except it just gets stuck on inference (after the find pillar_num line). I've also noticed one of the cores on the Xavier is being maxed out while this is happening.

Unfortunately I have no solution yet :(. If you find any lead, please tell me about it ! The problem happens both on my PC (Nvidia 2080) and my Nx Xavier.

GuillaumeAnoufa commented 1 year ago

Hello, Can someone help on this matter please ?

rjwb1 commented 1 year ago

@GuillaumeAnoufa I am experiencing the same issue. I suspect that it is related to this line as changing the values will still seemingly build the model correct without errors.

https://github.com/NVIDIA-AI-IOT/CUDA-PointPillars/blob/092affc36c72d7b8f7530685d4c0f538d987a94b/tool/simplifier_onnx.py#L29

I am also using a single class detector but I am also using a different pointcloud range and voxel size. I am going to train the model with 3 classes to verify if this is an issue with the number of class etc or the pointcloud range

rjwb1 commented 1 year ago

@byte-deve Hi do you know what are each of these numbers are a product of? '496' and '432'

rjwb1 commented 1 year ago

Hi, i realised this are the size of the feature grid

rjwb1 commented 1 year ago

@GuillaumeAnoufa I now have this working with a fully custom model, if you still need support you can @ me :)

mazm0002 commented 1 year ago

@rjwb1 hey, I'm having the same issues with setting up a custom model, would really appreciate some guidance :) This is my model and dataset config for reference:

################## MODEL CONFIG ##################### DATA_CONFIG: _BASECONFIG: cfgs/dataset_configs/mydata_dataset_only_cone.yaml POINT_CLOUD_RANGE: [0, -30.72, -3, 40.96, 30.72, 1] DATA_PROCESSOR:

NAME: mask_points_and_boxes_outside_range REMOVE_OUTSIDE_BOXES: True
NAME: shuffle_points SHUFFLE_ENABLED: { 'train': True, 'test': False }
NAME: transform_points_to_voxels VOXEL_SIZE: [0.16, 0.16, 4] MAX_POINTS_PER_VOXEL: 100 MAX_NUMBER_OF_VOXELS: { 'train': 20000, 'test': 60000 #16000 } DATA_AUGMENTOR: DISABLE_AUG_LIST: ['placeholder','gt_sampling'] AUG_CONFIG_LIST:
- NAME: gt_sampling USE_ROAD_PLANE: False DB_INFO_PATH:
  - "data" PREPARE: { filter_by_min_points: ['Cone:7'], filter_by_difficulty: [-1], }
    
    SAMPLE_GROUPS: ['Cone:200'] NUM_POINT_FEATURES: 5 DATABASE_WITH_FAKELIDAR: False REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0] LIMIT_WHOLE_SCENE: False
- NAME: random_world_flip ALONG_AXIS_LIST: ['x']
- NAME: random_world_rotation WORLD_ROT_ANGLE: [-0.78539816, 0.78539816]
- NAME: random_world_scaling WORLD_SCALE_RANGE: [0.95, 1.05]
- NAME: random_world_frustum_dropout INTENSITY_RANGE: [ 0, 0.2 ] DIRECTION: [ 'top' ]
- NAME: random_local_frustum_dropout INTENSITY_RANGE: [ 0, 0.2 ] DIRECTION: [ 'top' ]

MODEL: NAME: PointPillar

VFE:
    NAME: PillarVFE
    WITH_DISTANCE: False
    USE_ABSLOTE_XYZ: True
    USE_NORM: True
    NUM_FILTERS: [64]

MAP_TO_BEV:
    NAME: PointPillarScatter
    NUM_BEV_FEATURES: 64

BACKBONE_2D:
    NAME: BaseBEVBackbone
    LAYER_NUMS: [3, 5, 5]
    LAYER_STRIDES: [2, 2, 2]
    NUM_FILTERS: [64, 128, 256]
    UPSAMPLE_STRIDES: [1, 2, 4]
    NUM_UPSAMPLE_FILTERS: [128, 128, 128]

DENSE_HEAD:
    NAME: AnchorHeadSingle
    CLASS_AGNOSTIC: False

    USE_DIRECTION_CLASSIFIER: True
    DIR_OFFSET: 0.78539
    DIR_LIMIT_OFFSET: 0.0
    NUM_DIR_BINS: 2

    ANCHOR_GENERATOR_CONFIG: [
        {
          'class_name': 'Cone',
          'anchor_sizes': [ [ 0.3, 0.3, 0.6 ] ],
          'anchor_rotations': [ 0, 1.57 ],
          'anchor_bottom_heights': [ -0.7 ],
          'align_center': False,
          'feature_map_stride': 2,
          'matched_threshold': 0.6,
          'unmatched_threshold': 0.4
        }
    ]

    TARGET_ASSIGNER_CONFIG:
        NAME: AxisAlignedTargetAssigner
        POS_FRACTION: -1.0
        SAMPLE_SIZE: 512
        NORM_BY_NUM_EXAMPLES: False
        MATCH_HEIGHT: False
        BOX_CODER: ResidualCoder

    LOSS_CONFIG:
        LOSS_WEIGHTS: {
            'cls_weight': 1.0,
            'loc_weight': 2.0,
            'dir_weight': 0.2,
            'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
        }

POST_PROCESSING:
    RECALL_THRESH_LIST: [0.3, 0.5, 0.7]
    SCORE_THRESH: 0.3
    OUTPUT_RAW_SCORE: False

    EVAL_METRIC: kitti

    NMS_CONFIG:
        MULTI_CLASSES_NMS: False
        NMS_TYPE: nms_gpu
        NMS_THRESH: 0.01
        NMS_PRE_MAXSIZE: 300
        NMS_POST_MAXSIZE: 100

OPTIMIZATION: BATCH_SIZE_PER_GPU: 3 NUM_EPOCHS: 80

OPTIMIZER: adam_onecycle
LR: 0.003
WEIGHT_DECAY: 0.01
MOMENTUM: 0.9

MOMS: [0.95, 0.85]
PCT_START: 0.4
DIV_FACTOR: 10
DECAY_STEP_LIST: [35, 45]
LR_DECAY: 0.1
LR_CLIP: 0.0000001

LR_WARMUP: False
WARMUP_EPOCH: 1

########################## DATASET CONFIG ######################## FILTER_MIN_POINTS_IN_GT: 1 POINT_CLOUD_RANGE: [0, -30.72, -3, 40.96, 30.72, 1] # xmin, ymin, zmin, xmax, ymax, zmax

DATA_SPLIT: { 'train': train, 'test': val }

INFO_PATH: { 'train': [mydata_infos_train.pkl], 'test': [mydata_infos_val.pkl], }

TRAINING_CATEGORIES: { 'Cone': 'Cone', }

FOV_POINTS_ONLY: False

DATA_AUGMENTOR: DISABLE_AUG_LIST: ['placeholder','gt_sampling'] AUG_CONFIG_LIST:

NAME: gt_sampling USE_ROAD_PLANE: False DB_INFO_PATH:
- mydata_dbinfos_train.pkl PREPARE: { filter_by_min_points: ['Cone:20'], filter_by_difficulty: [-1], }
  
  SAMPLE_GROUPS: ['Cone:200'] NUM_POINT_FEATURES: 5 DATABASE_WITH_FAKELIDAR: False REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0] LIMIT_WHOLE_SCENE: True
NAME: random_world_flip ALONG_AXIS_LIST: ['x', 'y']
NAME: random_world_rotation WORLD_ROT_ANGLE: [-3.14159265, 3.114159265]
NAME: random_world_scaling WORLD_SCALE_RANGE: [0.95, 1.05]

POINT_FEATURE_ENCODING: { encoding_type: absolute_coordinates_encoding, used_feature_list: ['x', 'y', 'z', 'intensity'], src_feature_list: ['x', 'y', 'z', 'intensity', 'timestamp'], }

DATA_PROCESSOR:

NAME: mask_points_and_boxes_outside_range REMOVE_OUTSIDE_BOXES: True
NAME: shuffle_points SHUFFLE_ENABLED: { 'train': True, 'test': False }
NAME: transform_points_to_voxels VOXEL_SIZE: [0.16, 0.16, 4]

[0.05, 0.05, 0.06]

MAX_POINTS_PER_VOXEL: 5 MAX_NUMBER_OF_VOXELS: { 'train': 16000, 'test': 40000 }

GRAD_NORM_CLIP: 10

rjwb1 commented 1 year ago

@mazm0002 hi there, does the model train successfully and work in PyTorch? What stage of the process are you having trouble with?

mazm0002 commented 1 year ago

@rjwb1 Yea so I can train successfully and get the required outputs I expect. Then I use the onnx exporter tool to convert the model to onnx and run it with the demo feeding it custom test data (that works fine in PyTorch). TensorRT engine generates fine, but then when it actually did detections, they take a long time to process and there are way too many bounding boxes and most of them incorrect. Think the issue is probably in the onnx conversion, was wondering if you could let me know what you had to change in the tool to get it working for 1 class and custom data/model config. Thanks a lot for the help!

rjwb1 commented 1 year ago

@mazm0002 Hi, I too experienced this and it was due to some hard coded parameters inside the exporter. I also used this useful tool to inspect my generated onnx file to ensure it was similar to the default one:

https://netron.app/

https://github.com/lutzroeder/netron

Can you show me what values you have here or is it default?

https://github.com/NVIDIA-AI-IOT/CUDA-PointPillars/blob/092affc36c72d7b8f7530685d4c0f538d987a94b/tool/simplifier_onnx.py#L29-L45

rjwb1 commented 1 year ago

I think maybe with your model it should look like this?

 op_attrs["dense_shape"] = np.array([384,256]) 

 return self.layer(name="PPScatter_0", op="PPScatterPlugin", inputs=inputs, outputs=outputs, attrs=op_attrs) 

 def loop_node(graph, current_node, loop_time=0): 
   for i in range(loop_time): 
     next_node = [node for node in graph.nodes if len(node.inputs) != 0 and len(current_node.outputs) != 0 and node.inputs[0] == current_node.outputs[0]][0] 
     current_node = next_node 
   return next_node 

 def simplify_postprocess(onnx_model): 
   print("Use onnx_graphsurgeon to adjust postprocessing part in the onnx...") 
   graph = gs.import_onnx(onnx_model) 

 cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 192, 128, 2)) 
 box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 192, 128, 18)) 
 dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 192, 128, 4))

rjwb1 commented 1 year ago

The size of the scatter plugin array should be equal to the dimensions of the voxel grid

rjwb1 commented 1 year ago

I will open a PR to parameterise these values properly :)

rjwb1 commented 1 year ago

@mazm0002 can you try exporting with the changes I have made in #77

rjwb1 commented 1 year ago

As you are using additional pointcloud attributes (5 instead of 4) this may require further parameters

GuillaumeAnoufa commented 1 year ago

Hello @rjwb1 thanks for your inputs ! I added your changes but it didn't seem to solve my problem unfortunately. The shape of my model did not change using your PR because I already used the default grid size.

My config only has a few changes from the default config: POINT_CLOUD_RANGE: [0, -39.68, -3, 69.12, 39.68, 1] -> POINT_CLOUD_RANGE: [0, -39.68, -1, 69.12, 39.68, 7] VOXEL_SIZE: [0.16, 0.16, 4] -> VOXEL_SIZE: [0.16, 0.16, 8] The biggest change is the fact that I am using a single class instead of 3.

My exported model shape seems accurate but I am still experiencing these very long post-processing. Below, a picture of the output shape of the exported model: (rest of the model is exactly the same as the example one) my_model_output

rjwb1 commented 1 year ago

Hmmm, I am also using a single class... would you mind sending a copy of your cfg file and I will see if I can reproduce this

GuillaumeAnoufa commented 1 year ago

Sure: pointpillar2.txt I changed the _BASECONFIG to the default one. I don't think the _BASECONFIG matters here since everything is redefined in the actual config.

rjwb1 commented 1 year ago

@GuillaumeAnoufa looks almost identical to mine. Strange... I guess I also have my score thresh set to 0.4 and my nms thresh to 0.1 in my Params.h. This could reduce post processing latency?

GuillaumeAnoufa commented 1 year ago

@rjwb1 It doesn't seem to change anything.

I tried exporting the default "pointpillar_7728.pth" model with the default config and just reducing the number of classes from 3 to 1 and experience the same issue on the default data. Changing the number of class from 3 to 1 seem to be causing the bug on my code.

load file: ../data/data_velo/000001.bin find points num: 18630 find pillar_num: 6815 TIME: generateVoxels: 0.03072 ms. TIME: generateFeatures: 0.045824 ms. TIME: doinfer: 15.7839 ms. TIME: doPostprocessCuda: 64484.9 ms. TIME: pointpillar: 64500.8 ms. Bndbox objs: 4646 Saved prediction in: ../eval/kitti/object/pred_velo/000001.txt

Changing the number of classes in the config file results in a abnomarly high number predicted bounding boxes objects

GuillaumeAnoufa commented 1 year ago

@rjwb1 If you try exporting the default model with this config file(which is the default one but with a single class): pointpillar_1class.txt and infer on the default velodyne data do you experience slow post processing ? I know this exported model should not work anyway since the model has been trained for 3 classes but I would like to know if it is reproducible. Thanks a lot for your help :)

GuillaumeAnoufa commented 1 year ago

I forgot to copy the generated param.h and recompile after changing the model... Post processing time is back to normal, sorry for the inconvenience :sob:

rjwb1 commented 1 year ago

@GuillaumeAnoufa no worries, glad you found the solution 👍🏼

mx2013713828 commented 1 year ago

@mazm0002 can you try exporting with the changes I have made in #77

Hi, thanks for your work, I change the files according your pr#77, and I moved parms.h and also recompiled. But the inference is still very slow in 'doPostprocessCuda' . my model have 4 classes and tested in OpenPCDet correctly, could you give me some ideas?I would appreciate it very much！

mx2013713828 commented 1 year ago

Could you tell me how you solved your question? I also meet this problem ,I found it generate more than 1 millon boxes before nms ,so the postprocess is very slow. I change my code following @rjwb1 ,but not works.

big773 commented 1 year ago

I can export my custom model to onnx, but the result seems incorrect,can you give me some advice

zzt007 commented 1 year ago

@rjwb1 hello , thanks for your guidance very much. I changed the paramters just like u did , but the problem went from slow post-processing to cuda error: illegal memory access . I also try to use my own model which detect only one class,and also add the ROS, So I sincerely hope u can tell me how to solve the problem ,it brothers me a few days.

NVIDIA-AI-IOT / CUDA-PointPillars

PostProcessCuda is very slow using my model #71

[0.05, 0.05, 0.06]