OCRTOC / OCRTOC_software_package

68 stars 30 forks source link

CUDA error in pointnet2 #8

Closed elevenjiang1 closed 3 years ago

elevenjiang1 commented 3 years ago

I can not run the baseline test for CUDA error in pointnet2;I guess maybe because I run on GTX1060 but driver is too high?

Error happen in roslaunch ocrtoc_task solution.launch After robot arm move around the desk,the error happen,in Graspnet infer forward.I also test groupedxyz = torch.matmul(groupedxyz, rot) singly,and it is good

import torch

A=torch.randn((1,1024,64,3))
B=torch.randn((1,1024,3,4))

out=torch.matmul(A,B)
print("out shape:")
print(out.shape)

Error code is: [INFO] [1633087308.864957, 102.472000]: Get a joint space goal: [INFO] [1633087308.865770, 102.472000]: joint_goal: [1.4834139747315818, -0.22268014487058366, -0.1985030742800026, -2.412410026148746, -0.6452087935672866, 1.5152949988444642, 0.08003010758840406] [ INFO] [1633087308.875540450, 102.496000000]: Combined planning and execution request received for MoveGroup action. Forwarding to planning and execution pipeline. [ INFO] [1633087308.875615357, 102.496000000]: Planning attempt 1 of at most 1 [ INFO] [1633087308.875970141, 102.500000000]: Planner configuration 'panda_arm' will use planner 'geometric::RRTConnect'. Additional configuration parameters will be set when the planner is constructed. [ INFO] [1633087308.876096994, 102.500000000]: RRTConnect: Starting planning with 1 states already in datastructure [ INFO] [1633087308.887386082, 102.504000000]: RRTConnect: Created 5 states (2 start + 3 goal) [ INFO] [1633087308.887415557, 102.504000000]: Solution found in 0.011376 seconds [ INFO] [1633087308.892055194, 102.528000000]: SimpleSetup: Path simplification took 0.004602 seconds and changed from 4 to 2 states [ INFO] [1633087316.757148009, 110.360000000]: Controller position_joint_trajectory_controller successfully finished [ INFO] [1633087316.814564999, 110.452000000]: Completed trajectory execution with status SUCCEEDED ... [ INFO] [1633087316.824338807, 110.464000000]: Received event 'stop'

[INFO] [1633087316.826438, 110.464000]: Finish joint space goal:[ 1.48341397 -0.22268014 -0.19850307 -2.41241003 -0.64520879 1.515295 0.08003011] [ERROR] [1633087333.753505, 127.272000]: Error processing request: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches) ['Traceback (most recent call last):\n', ' File "/opt/ros/melodic/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 632, in _handle_request\n response = convert_return_to_response(self.handler(request), self.response_class)\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/scripts/perception_server.py", line 43, in perception_target_handler\n object_list = request.target_object_list,\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/perceptor.py", line 471, in get_response\n pose_method = self.config[\'pose_method\']\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/perceptor.py", line 439, in percept\n gg = self.compute_grasp_pose(full_pcd)\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/perceptor.py", line 300, in compute_grasp_pose\n gg = self.graspnet_baseline.inference(grasp_pcd)\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/graspnet_baseline.py", line 91, in inference\n gg = self.get_grasps(end_points)\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/graspnet_baseline.py", line 80, in get_grasps\n end_points = self.net(end_points)\n', ' File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl\n result = self.forward(*input, kwargs)\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/models/graspnet.py", line 68, in forward\n end_points = self.grasp_generator(end_points)\n', ' File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl\n result = self.forward(*input, *kwargs)\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/models/graspnet.py", line 51, in forward\n vp_features = self.crop(seed_xyz, pointcloud, grasp_top_views_rot)\n', ' File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl\n result = self.forward(input, kwargs)\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/models/modules.py", line 129, in forward\n pointcloud, seed_xyz, vp_rot\n', ' File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl\n result = self.forward(*input, *kwargs)\n', ' File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/pointnet2/pointnet2_utils.py", line 529, in forward\n groupedxyz = torch.matmul(groupedxyz, rot)\n', 'RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)\n'] [ERROR] [1633087333.755645, 127.272000]: Exception in your execute callback: service [/perception_action_target] responded with an error: error processing request: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches) Traceback (most recent call last): File "/opt/ros/melodic/lib/python2.7/dist-packages/actionlib/simple_action_server.py", line 289, in executeLoop self.execute_callback(goal) File "/root/ocrtoc_ws/src/ocrtoc_planning/scripts/solution_entrance.py", line 32, in execute_callback self.planner.cycle_plan_all() File "/root/ocrtoc_ws/src/ocrtoc_planning/scripts/task_planning.py", line 370, in cycle_plan_all self.get_pose_perception(left_object_dic.keys()) # get pose from perception File "/root/ocrtoc_ws/src/ocrtoc_planning/scripts/task_planning.py", line 122, in get_pose_perception perception_result = self._service_get_target_pose(request_msg) File "/opt/ros/melodic/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 442, in call return self.call(args, **kwds) File "/opt/ros/melodic/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 522, in call responses = transport.receive_once() File "/opt/ros/melodic/lib/python2.7/dist-packages/rospy/impl/tcpros_base.py", line 735, in receive_once p.read_messages(b, msg_queue, sock) File "/opt/ros/melodic/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 360, in read_messages self._read_ok_byte(b, sock) File "/opt/ros/melodic/lib/python2.7/dist-packages/rospy/impl/tcpros_service.py", line 343, in _read_ok_byte raise ServiceException("service [%s] responded with an error: %s"%(self.resolved_name, str)) ServiceException: service [/perception_action_target] responded with an error: error processing request: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

Screenshot from 2021-10-01 19-29-30

elevenjiang1 commented 3 years ago

After upgrade torch from 1.8 to 1.9,This problem solved~

DCNSW commented 3 years ago

I have the same issue, but after upgrade torch from 1.8 to 1.9 by using the command pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio===0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

I get another error about import of pointnet in advance: File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/models/backbone.py", line 14, in <module> from ..pointnet2.pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/pointnet2/pointnet2_modules.py", line 21, in <module> import pointnet2_utils File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/pointnet2/pointnet2_utils.py", line 30, in <module> "Could not import _ext module.\n" ImportError: Could not import _ext module. Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst

Run on GTX1080, have you got the same error? @elevenjiang1 @GouMinghao

elevenjiang1 commented 3 years ago

Yes I have. Pointnet2 may need to rebuild it,which command you did by:

cd /root/ocrtoc_ws/src/tools
bash setup_env.sh

However,owing that you did it before,so may need to remove the build file; My solution is download a new ocrtoc_perception file from official code,then run the command above to rebuild pointnet2.

DCNSW commented 3 years ago

Thank you, it helps a lot.

QAbot-zh commented 3 years ago

@elevenjiang1 @DCNSW @GouMinghao Hello, I meet the same problem. I upgrade torch from 1.8 to 1.9 by using the command

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio===0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

And I download a new ocrtoc_perception file from official code to replace old file, rebuild it, but I still get error info as below:

> ROS_MASTER_URI=http://localhost:11311
>
> process[perception_entrance-1]: started with pid [28422]
> process[solution_entrance-2]: started with pid [28426]
> process[controller_spawner-3]: started with pid [28453]
> [INFO] [1634914148.877098]: Controller Spawner: Waiting for service controller_manager/load_controller
> process[move_group-4]: started with pid [28480]
> [INFO] [1634914149.049253]: solution_server is running.
> [ WARN] [1634914149.055690254]: Link 'camera_base' is not known to URDF. Cannot disable collisons.
> [ WARN] [1634914149.058060577]: Link 'usb_plug_link' is not known to URDF. Cannot disable collisons.
> [ WARN] [1634914149.058085914]: Link 'usb_plug_link' is not known to URDF. Cannot disable collisons.
> [ INFO] [1634914149.058188728]: Loading robot model 'panda'...
> process[rviz_zxy_MS_7B23_28374_3466385779312294322-5]: started with pid [28521]
> QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
> [ INFO] [1634914149.354336972]: rviz version 1.13.18
> [ INFO] [1634914149.354369446]: compiled against Qt version 5.9.5
> [ INFO] [1634914149.354378351]: compiled against OGRE version 1.9.0 (Ghadamon)
> [ INFO] [1634914149.380034079]: Forcing OpenGl version 0.
> [ INFO] [1634914149.574102093]: Stereo is NOT SUPPORTED
> [ INFO] [1634914149.574183307]: OpenGL device: llvmpipe (LLVM 10.0.0, 256 bits)
> [ INFO] [1634914149.574213358]: OpenGl version: 3.1 (GLSL 1.4).
> process[manipulator_interface_node-6]: started with pid [28578]
> [ INFO] [1634914150.123325718]: Publishing maintained planning scene on 'monitored_planning_scene'
> [ INFO] [1634914150.125726747]: MoveGroup debug mode is OFF
> Starting planning scene monitors...
> [ INFO] [1634914150.125748542]: Starting planning scene monitor
> [ INFO] [1634914150.127955969]: Listening to '/planning_scene'
> [ INFO] [1634914150.127974952]: Starting world geometry update monitor for collision objects, attached objects, octomap updates.
> [ INFO] [1634914150.130566322]: Listening to '/collision_object'
> [ INFO] [1634914150.137528238]: Listening to '/planning_scene_world' for planning scene world geometry
> [INFO] [1634914150.360102]: ['/root/ocrtoc_ws/src/ocrtoc_common/scripts/manipulator_interface_node.py', '__name:=manipulator_interface_node', '__log:=/root/.ros/log/127058ee-3347-11ec-8310-200db023693e/manipulator_interface_node-6.log']
> [ WARN] [1634914150.368705371]: Link 'camera_base' is not known to URDF. Cannot disable collisons.
> WARNING - 2021-10-22 22:49:10,371 - rigid_transformations - autolab_core not installed as catkin package, RigidTransform ros methods will be unavailable
> [ WARN] [1634914150.373406319]: Link 'usb_plug_link' is not known to URDF. Cannot disable collisons.
> [ WARN] [1634914150.373426284]: Link 'usb_plug_link' is not known to URDF. Cannot disable collisons.
> [ INFO] [1634914150.373501343]: Loading robot model 'panda'...
> Traceback (most recent call last):
>   File "/root/ocrtoc_ws/src/ocrtoc_perception/scripts/perception_entrance.py", line 4, in <module>
>     from perception_server import PerceptionServer
>   File "/root/ocrtoc_ws/src/ocrtoc_perception/scripts/perception_server.py", line 12, in <module>
>     from ocrtoc_perception import Perceptor
>   File "/root/ocrtoc_ws/devel/lib/python2.7/dist-packages/ocrtoc_perception/__init__.py", line 34, in <module>
>     exec(__fh.read())
>   File "<string>", line 2, in <module>
>   File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/__init__.py", line 1, in <module>
>     from .graspnet_baseline import GraspNetBaseLine
>   File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/graspnet_baseline.py", line 18, in <module>
>     from .dataset.graspnet_dataset import GraspNetDataset
>   File "/root/ocrtoc_ws/src/ocrtoc_perception/src/ocrtoc_perception/graspnet/dataset/graspnet_dataset.py", line 12, in <module>
>     from torch._six import container_abcs
> ImportError: cannot import name 'container_abcs'
> INFO - 2021-10-22 22:49:10,462 - core - signal_shutdown [atexit]
> [ INFO] [1634914150.721425295]: waitForService: Service [/get_planning_scene] has not been advertised, waiting...
> [perception_entrance-1] process has died [pid 28422, exit code 1, cmd /root/ocrtoc_ws/src/ocrtoc_perception/scripts/perception_entrance.py __name:=perception_entrance __log:=/root/.ros/log/127058ee-3347-11ec-8310-200db023693e/perception_entrance-1.log].
> log file: /root/.ros/log/127058ee-3347-11ec-8310-200db023693e/perception_entrance-1*.log
> [ INFO] [1634914152.466965777]: Listening to '/camera/depth_registered/points' using message filter with target frame 'world '
> [ INFO] [1634914152.471998827]: Listening to '/attached_collision_object' for attached collision objects
> Planning scene monitors started.
> [ INFO] [1634914152.488824763]: Initializing OMPL interface using ROS parameters
> [ INFO] [1634914152.539834919]: Using planning interface 'OMPL'
> [ERROR] [1634914152.541357595]: Exception while loading planning adapter plugin 'default_planner_request_adapters/ResolveConstraintFrames': According to the loaded plugin descriptions the class default_planner_request_adapters/ResolveConstraintFrames with base class type planning_request_adapter::PlanningRequestAdapter does not exist. Declared types are  default_planner_request_adapters/AddIterativeSplineParameterization default_planner_request_adapters/AddTimeOptimalParameterization default_planner_request_adapters/AddTimeParameterization default_planner_request_adapters/Empty default_planner_request_adapters/FixStartStateBounds default_planner_request_adapters/FixStartStateCollision default_planner_request_adapters/FixStartStatePathConstraints default_planner_request_adapters/FixWorkspaceBounds
> [ INFO] [1634914152.541742070]: Param 'default_workspace_bounds' was not set. Using default value: 10
> [ INFO] [1634914152.542098029]: Param 'start_state_max_bounds_error' was set to 0.1
> [ INFO] [1634914152.542436418]: Param 'start_state_max_dt' was not set. Using default value: 0.5
> [ INFO] [1634914152.542836010]: Param 'start_state_max_dt' was not set. Using default value: 0.5
> [ INFO] [1634914152.543207541]: Param 'jiggle_fraction' was set to 0.05
> [ INFO] [1634914152.543546756]: Param 'max_sampling_attempts' was not set. Using default value: 100
> [ INFO] [1634914152.543575334]: Using planning request adapter 'Add Time Parameterization'
> [ INFO] [1634914152.543588495]: Using planning request adapter 'Fix Workspace Bounds'
> [ INFO] [1634914152.543603087]: Using planning request adapter 'Fix Start State Bounds'
> [ INFO] [1634914152.543619153]: Using planning request adapter 'Fix Start State In Collision'
> [ INFO] [1634914152.543630885]: Using planning request adapter 'Fix Start State Path Constraints'
> [ERROR] [1634914152.853865527]: PluginlibFactory: The plugin for class 'rviz_visual_tools/KeyTool' failed to load.  Error: According to the loaded plugin descriptions the class rviz_visual_tools/KeyTool with base class type rviz::Tool does not exist. Declared types are  rviz/FocusCamera rviz/Interact rviz/Measure rviz/MoveCamera rviz/PublishPoint rviz/Select rviz/SetGoal rviz/SetInitialPose rviz_plugin_tutorials/PlantFlag
> [rospack] Error: no package given
> [librospack]: error while executing command
> [rospack] Error: no package given
> [librospack]: error while executing command
> [ERROR] [1634914152.862015123]: PluginlibFactory: The plugin for class 'rviz_visual_tools/RvizVisualToolsGui' failed to load.  Error: According to the loaded plugin descriptions the class rviz_visual_tools/RvizVisualToolsGui with base class type rviz::Panel does not exist. Declared types are  rviz_plugin_tutorials/Teleop
> [rospack] Error: no package given
> [librospack]: error while executing command
> [rospack] Error: no package given
> [librospack]: error while executing command
> [ WARN] [1634914152.945786218]: Link 'camera_base' is not known to URDF. Cannot disable collisons.
> [ WARN] [1634914152.945817555]: Link 'usb_plug_link' is not known to URDF. Cannot disable collisons.
> [ WARN] [1634914152.945830469]: Link 'usb_plug_link' is not known to URDF. Cannot disable collisons.
> [ INFO] [1634914152.945955561]: Loading robot model 'panda'...
> [ INFO] [1634914153.490625297]: Added FollowJointTrajectory controller for position_joint_trajectory_controller
> [ INFO] [1634914153.809836654]: Added GripperCommand controller for franka_gripper
> [ INFO] [1634914153.809903436]: Returned 2 controllers in list
> [ INFO] [1634914153.889127206]: Trajectory execution is managing controllers
> Loading 'move_group/ApplyPlanningSceneService'...
> Loading 'move_group/ClearOctomapService'...
> Loading 'move_group/MoveGroupCartesianPathService'...
> Loading 'move_group/MoveGroupExecuteTrajectoryAction'...
> [ INFO] [1634914153.919477499]: Starting planning scene monitor
> [ INFO] [1634914153.923402954]: Listening to '/move_group/monitored_planning_scene'
> Loading 'move_group/MoveGroupGetPlanningSceneService'...
> Loading 'move_group/MoveGroupKinematicsService'...
> Loading 'move_group/MoveGroupMoveAction'...
> Loading 'move_group/MoveGroupPickPlaceAction'...
> [ INFO] [1634914153.958224663]: waitForService: Service [/get_planning_scene] is now available.
> Loading 'move_group/MoveGroupPlanService'...
> Loading 'move_group/MoveGroupQueryPlannersService'...
> Loading 'move_group/MoveGroupStateValidationService'...
> [ INFO] [1634914154.011149650]: 
>
> ********************************************************
> * MoveGroup using: 
> *     - ApplyPlanningSceneService
> *     - ClearOctomapService
> *     - CartesianPathService
> *     - ExecuteTrajectoryAction
> *     - GetPlanningSceneService
> *     - KinematicsService
> *     - MoveAction
> *     - PickPlaceAction
> *     - MotionPlanService
> *     - QueryPlannersService
> *     - StateValidationService
> ********************************************************
>
> [ INFO] [1634914154.011199333]: MoveGroup context using planning plugin ompl_interface/OMPLPlanner
> [ INFO] [1634914154.011217094]: MoveGroup context initialization complete
>
> You can start planning now!

Could you give me some advice? Run on GTX1060 & docker .