PaddlePaddle / Paddle3D

A 3D computer vision development toolkit based on PaddlePaddle. It supports point-cloud object detection, segmentation, and monocular 3D object detection models.
Apache License 2.0
552 stars 135 forks source link

CenterPoint Training Error: RuntimeError: (PreconditionNotMet) The meta data must be valid when call the mutable data function. #466

Open yaobaishen opened 1 month ago

yaobaishen commented 1 month ago

The error is reported from apollo-model-centerpoint, which uses paddle as backend, I found a similar issue here, but looks like not the same root cause: https://github.com/PaddlePaddle/Paddle3D/issues/118

below is my error log:

Traceback (most recent call last):
  File "tools/train.py", line 207, in <module>
    main(args)
  File "tools/train.py", line 202, in main
    trainer.train()
  File "/home/nsoft/Documents/github_code/apollo-model-centerpoint/paddle3d/apis/trainer.py", line 290, in train
    output = training_step(
  File "/home/nsoft/Documents/github_code/apollo-model-centerpoint/paddle3d/apis/pipeline.py", line 66, in training_step
    outputs = model(sample)
  File "/home/nsoft/anaconda3/envs/cp_paddle_cu11/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 1012, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/nsoft/Documents/github_code/apollo-model-centerpoint/paddle3d/models/base/base_model.py", line 70, in forward
    return self.train_forward(samples, *args, **kwargs)
  File "/home/nsoft/Documents/github_code/apollo-model-centerpoint/paddle3d/models/detection/centerpoint/centerpoint.py", line 146, in train_forward
    x = self.extract_feat(data)
  File "/home/nsoft/Documents/github_code/apollo-model-centerpoint/paddle3d/models/detection/centerpoint/centerpoint.py", line 120, in extract_feat
    voxels, coordinates, num_points_in_voxel = self.voxelizer(
  File "/home/nsoft/anaconda3/envs/cp_paddle_cu11/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 1012, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/nsoft/Documents/github_code/apollo-model-centerpoint/paddle3d/models/voxelizers/voxelize.py", line 75, in forward
    voxels, coors_pad, num_points_per_voxel = self.single_forward(
  File "/home/nsoft/Documents/github_code/apollo-model-centerpoint/paddle3d/models/voxelizers/voxelize.py", line 57, in single_forward
    coors = coors.reshape([1, -1, 3])
  File "/home/nsoft/anaconda3/envs/cp_paddle_cu11/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 3543, in reshape
    out = _C_ops.reshape(x, shape)
RuntimeError: (PreconditionNotMet) The meta data must be valid when call the mutable data function. (at /paddle/paddle/phi/core/dense_tensor.cc:111)

I think the root cause is, hard_voxelize() returns invalid coors, so the the reshape() operation failed. But when look into the hard_voxelize() , it actually runs below code (sorry that I don't find the source code location in the github project)

core.eager._run_custom_op(ctx, "hard_voxelize", True)

And I am wondering what's the "_run_custom_op" inside of hard_voxelize() , can anyone give some hints? Great thanks.

LielinJiang commented 1 month ago

https://github.com/PaddlePaddle/Paddle3D/blob/develop/paddle3d/ops/voxel/voxelize_op.cc#L183 you can find hard_voxelize here

yaobaishen commented 1 month ago

@LielinJiang thanks! I ignored this piece of code because I am using VS code python debug mode, and looks like add breakpoint in the CC file won't take effect. Anyway, I can dig further now to find the root cause of my training fail, maybe add more logs into the CC files.

yaobaishen commented 1 month ago

Please keep this issue alive as I am still finding why the coors returned by hard_voxelize() is invalid, thanks.