NVlabs / curobo

CUDA Accelerated Robot Library
https://curobo.org
Other
798 stars 125 forks source link

CUDA graph Error after adding spheres #110

Closed pmj110119 closed 10 months ago

pmj110119 commented 10 months ago

If it’s not a bug, please use discussions: https://github.com/NVlabs/curobo/discussions

Please provide the below information in addition to your issue:

  1. cuRobo installation mode (choose from [python, isaac sim, docker python, docker isaac sim]): docker isaac sim
  2. python version: 3.7
  3. Isaac Sim version (if using): 2022.1.0

Issue Details

Hi guys, I'm using MotionGeneration's plan_single function. When I try to add some spheres, some spheres will cause CUDA Error. I want to know how to avoid this error?

Specifically, I encountered errors in the following two scenarios:

  1. There is a protruding camera on the head of my robotic arm, and I specially set up several sphere balls for it.
  2. I used attach_spheres_to_robot() to add a cube sphere. It seems the error will occur when spheres not fit the robot well?
Traceback (most recent call last):
  File "src/planner.py", line 241, in plan
    result = self.motion_gen.plan_single(joint_state.unsqueeze(0), ik_goal, self.plan_config)
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/reacher/motion_gen.py", line 1229, in plan_single
    link_poses=link_poses,
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/reacher/motion_gen.py", line 1048, in _plan_attempts
    link_poses,
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/reacher/motion_gen.py", line 1529, in _plan_from_solve_state
    return_all_solutions=plan_config.parallel_finetune,
  File "/isaac-sim/kit/python/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/reacher/motion_gen.py", line 920, in _solve_trajopt_from_solve_state
    newton_iters=newton_iters,
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/reacher/trajopt.py", line 498, in solve_any
    newton_iters=newton_iters,
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/reacher/trajopt.py", line 620, in solve_single
    newton_iters=newton_iters,
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/reacher/trajopt.py", line 579, in solve_from_solve_state
    result = self.solver.solve(goal_buffer, seed_traj)
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/wrap_base.py", line 141, in solve
    act_seq = self.optimize(seed, shift_steps=0)
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/wrap/wrap_base.py", line 71, in optimize
    act_seq = opt.optimize(act_seq, shift_steps)
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/opt/opt_base.py", line 96, in optimize
    out = self._optimize(opt_tensor, shift_steps, n_iters)
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/opt/newton/newton_base.py", line 144, in _optimize
    best_q, best_cost, q, grad_q = self._call_opt_iters_graph(q, grad_q)
  File "/isaac-sim/kit/python/lib/python3.7/site-packages/curobo/opt/newton/newton_base.py", line 513, in _call_opt_iters_graph
    self.cu_opt_graph.replay()
  File "/isaac-sim/extscache/omni.pip.torch-1_13_1-0.1.4+104.2.lx64/torch-1-13-1/torch/cuda/graphs.py", line 87, in replay
    super(CUDAGraph, self).replay()
RuntimeError: false INTERNAL ASSERT FAILED at "../c10/cuda/CUDAGraphsC10Utils.h":73, please report a bug to PyTorch. Unknown CUDA graph CaptureStatus32558
pmj110119 commented 10 months ago

It seems that no error will be reported when using robot: franka.yml.

My own robot could use MotionGeneration normally before, but an error was reported after attach_object_to_robot.

I filled in the robot config based on franka.yml, and I entered some keys speculatively. May I ask what might be causing the error?

My own robot config:

# /isaac-sim/kit/python/lib/python3.7/site-packages/curobo/content/configs/robot/jaka-zu5.yml
robot_cfg:
  kinematics:
    usd_path: "robot/ur_description/jaka-zu5.usd"
    usd_robot_root: "/Zu_5"
    isaac_usd_path: "robot/ur_description/jaka-zu5.usd"
    # usd_flip_joints: ["Joint01","Joint04"]
    # usd_flip_joints: {
    #   "Joint01": "Z",
    #   "Joint04": "Z",
    # }

    urdf_path: "robot/ur_description/jaka-zu5.urdf"
    asset_root_path: "robot/ur_description"

    base_link: "Link_00"
    ee_link: "Link_06"
    lock_joints: {
      "Left_1_Joint": -1.0, 
      "Right_1_Joint": -1.0,
      "Left_Support_Joint": 1.0,
      "Right_Support_Joint": 1.0
      }

    extra_links: {
      "attached_object":{
        "parent_link_name": "Link_06",
        "link_name": "attached_object",
        "fixed_transform": [0,0,0,1,0,0,0],
        "joint_type": "FIXED",
        "joint_name": "attach_joint",
      }
    }
    # lock_joints: null
    # collision_link_names: null
    # collision_spheres: null
    # self_collision_ignore: {}
    # self_collision_buffer: {}

    collision_link_names: 
      [
        "Link_00",
        "Link_01",
        "Link_02",
        "Link_03",
        "Link_04",
        "Link_05",
        "Link_06",
        "base_link",
        "Left_1_Link",
        "Left_Pad_Link",
        "Left_Support_Link",
        "Right_1_Link",
        "Right_Support_Link",
        "Right_Pad_Link",
        # "link_camera",
        # "link_cube",
        "attached_object",
      ]
    collision_spheres:  "jaka-zu5-lula.yaml"

    collision_sphere_buffer: 0.0025
    extra_collision_spheres: {"attached_object": 4}

    self_collision_ignore:
      {
        "Link_00": ["Link_01", "Link_02"],
        "Link_01": ["Link_02", "Link_03", "Link_04"],
        "Link_02": ["Link_03", "Link_04"],
        "Link_03": ["Link_04", "Link_05"],
        "Link_04": ["Link_03", "Link_05", "Link_06"],
        "Link_05": ["Link_04", "Link_06", "base_link"],
        "Link_06": ["base_link", "Link_05"],
        "base_link": ["Link_06", "Left_1_Link", "Left_2_Link", "Left_Support_Link", "Left_Pad_Link", "Right_1_Link", "Right_2_Link", "Right_Support_Link", "Right_Pad_Link"],
        "Left_1_Link": ["base_link", "Left_2_Link", "Left_Support_Link", "Left_Pad_Link", "Right_1_Link", "Right_2_Link", "Right_Support_Link", "Right_Pad_Link"],
        "Right_1_Link": ["base_link", "Left_1_Link", "Left_2_Link", "Left_Support_Link", "Left_Pad_Link", "Right_2_Link", "Right_Support_Link", "Right_Pad_Link"],
        "Left_1_Link": ["base_link", "Left_2_Link", "Left_Support_Link", "Left_Pad_Link", "Right_1_Link", "Right_2_Link", "Right_Support_Link", "Right_Pad_Link"],
        "Left_2_Link": ["base_link", "Left_1_Link", "Left_Support_Link", "Left_Pad_Link", "Right_1_Link", "Right_2_Link", "Right_Support_Link", "Right_Pad_Link"],
        # "link_camera": ["base_link", "Left_1_Link", "Left_Support_Link", "Left_Pad_Link", "Right_1_Link", "Right_2_Link", "Right_Support_Link", "Right_Pad_Link"],

      }
    self_collision_buffer: 
      {
        "Link_00": 0.0,
        "Link_01": 0.0,
        "Link_02": 0.0,
        "Link_03": 0.0,
        "Link_04": 0.0,
        "Link_05": 0.0,
        "Link_06": 0.0,
        "base_link": 0.0,
        # "Left_1_Link": 0.0,
        # "Left_2_Link": 0.0,
        # "Left_Pad_Link": 0.0,
        # "Left_Support_Link": 0.0,
        # "Right_1_Link": 0.0,
        # "Right_2_Link": 0.0,
        # "Right_Pad_Link": 0.0,
        # "Right_Support_Link": 0.0,
        # "link_camera": 0.0,
        "attached_object": 0.0,
      }

    use_global_cumul: True
    # mesh_link_names: null
    mesh_link_names:
      [
        "Link_00",
        "Link_01",
        "Link_02",
        "Link_03",
        "Link_04",
        "Link_05",
        "Link_06",
        "base_link",
      ]

    cspace:
      joint_names: ["Joint01","Joint02","Joint03","Joint04","Joint05","Joint06", "Left_1_Joint", "Right_1_Joint" , "Left_Support_Joint", "Right_Support_Joint"]
      retract_config: [2.91469985,  1.65806279, -1.02974426,  1.12922803,  1.57428699, 0.85172067, -1, -1, 1, 1]

      # joint_names: ["Joint01","Joint02","Joint03","Joint04","Joint05","Joint06"]
      # retract_config: [2.91469985,  1.65806279, -1.02974426,  1.12922803,  1.57428699, 0.85172067]

      null_space_weight: [1,1,1,1,1,1, 1,1,1,1]
      cspace_distance_weight: [1,1,1,1,1,1, 1,1,1,1]
      max_jerk: 500.0
      max_acceleration: 15.0
balakumar-s commented 10 months ago

Attach spheres to robot is used for updating spheres of a specific link. This is used to attach spheres dynamically for a grasped object. Example: https://curobo.org/source/advanced_examples/2b_block_stacking_example.html

If you want to permanently attach spheres, follow this tutorial to create spheres for your robot: https://curobo.org/source/tutorials/1_robot_configuration.html

pmj110119 commented 10 months ago

Yes, I indeedly used https://curobo.org/source/advanced_examples/2b_block_stacking_example.html.

My question is that when I attach spheres, MotionGeneration will report an error.

balakumar-s commented 10 months ago

In your robot configuration file,make the value 100 spheres instead of 4:

extra_collision_spheres: {"attached_object": 4}

balakumar-s commented 10 months ago

Actually I'm not sure how you are attaching spheres, can you share minimal python code?

pmj110119 commented 10 months ago

I use the code below to create a cuboid manually and attach spheres to the MotionGeneration:

      # ee_pose
      pos = np.array([-0.61692637, -0.06009333,  0.52288121])
      quater = np.array([-0.73811456, -0.21932792,  0.60144261,  0.21295291])
      ee_pose = Pose(
            position=self.tensor_args.to_device(pos),
            quaternion=self.tensor_args.to_device(quater),
      )

      # get spheres
      obs = Cuboid(
              name='target_obj',
              pose = [-0.94450682, -0.03707641, 0.45553475+0.02, 1, 0, 0, 0],
              dims = [0.045, 0.045, 0.15],
          )
      sph = obs.get_bounding_spheres(
          n_spheres=4,
          surface_sphere_radius=0.001,
          pre_transform_pose=ee_pose,
          tensor_args=self.tensor_args,
          fit_type=SphereFitType.VOXEL_VOLUME_SAMPLE_SURFACE,
          voxelize_method='ray',
      )
      sph_list = [s.position + [s.radius] for s in sph]

      spheres = self.tensor_args.to_device(torch.as_tensor(sph_list))

      # attach spheres
      self.motion_gen.attach_spheres_to_robot(sphere_tensor=spheres, link_name="attached_object")
balakumar-s commented 10 months ago

What's the shape of spheres tensor?

pmj110119 commented 10 months ago

What's the shape of spheres tensor?

torch.Size([4, 4]), and it corresponds to the settings in robot config.

extra_collision_spheres: {"attached_object": 4}
pmj110119 commented 10 months ago

If the collision world is not updated with the following code, plan_single will not report an error.

self.world_model.add_camera_frame(data_camera, "world")
self.world_model.process_camera_frames("world", False)
torch.cuda.synchronize()
self.world_model.update_blox_hashes()

But if I have updated collision world and attach spheres, I will get an error almost every time.

balakumar-s commented 10 months ago

We have a check here: https://github.com/NVlabs/curobo/blob/c09d94908d8ad52f557ed59195c1ceb2d0434d65/src/curobo/wrap/reacher/motion_gen.py#L2381

To make sure size is exact.

If the size is not the issue, can you disable cuda graph and run to see the exact error?

pmj110119 commented 10 months ago

Yes, the size is exact.

And when I disable the cuda_graph, it didn't report error. But the speed is too slow.

pmj110119 commented 10 months ago

Hi balakumar, after a lot of testing, I think I found the problem.

When I set spheres_num to 100, even if I disable cuda_graph, a CUDA Error will always be reported.

extra_collision_spheres: {"attached_object": 100}

When I set spheres_num to a smaller number such as 4 or 8, the probability of CUDA Error becomes very low (but still exists)

In other words, some randomly generated sphere balls will cause CUDA Error. Do you have any opinion? @balakumar-s

balakumar-s commented 10 months ago

Can you share your full script? If the error only happens with CudaGraph then the issue could be somewhere else. E.g., changing planning mode from single to batch.

pmj110119 commented 10 months ago

Thanks for reply! My main code is shown below and the function entry is Planner.plan().

If I don't use attach_spheres_to_robot (set attach=False), the code can always run without error.

class Planner:   
    def __init__(self, 
                 robot, 
                 collision_cfg='./my_collision_wall.yml'
                 ):
        self.tensor_args = TensorDeviceType()
        self._load_robot(robot)
        self._load_world(collision_cfg)
        self._init_motion_gen()

    def _load_robot(self, robot):
        self.robot_cfg = load_yaml(join_path(get_robot_configs_path(), robot))["robot_cfg"]

    def _load_world(self, obstacles_cfg='./my_collision_wall.yml'):
        self.world_cfg = WorldConfig.from_dict(
            {
                "blox": {
                    "world": {
                        "pose": [0, 0, 0, 1, 0, 0, 0],
                        "integrator_type": "tsdf",
                        "voxel_size": 0.01,
                    }
                }
            }
        )
        # add obstacle into collision_world
        self.world_cfg_obs = WorldConfig.from_dict(load_yaml(obstacles_cfg))
        self.world_cfg.add_obstacle(self.world_cfg_obs.cuboid[0])

    def _init_motion_gen(self):
        self.plan_config = MotionGenPlanConfig(
            enable_graph=False, enable_graph_attempt=7, max_attempts=5, enable_finetune_trajopt=True
        )

        motion_gen_config = MotionGenConfig.load_from_robot_config(
            self.robot_cfg,
            self.world_cfg,
            self.tensor_args,
            trajopt_tsteps=32,
            collision_checker_type=CollisionCheckerType.BLOX,
            use_cuda_graph=False,
            num_trajopt_seeds=24,
            num_graph_seeds=24,
            interpolation_dt=0.02,  # 0.03
            collision_activation_distance=0.025,    # 0.025
            acceleration_scale=1.0,
            self_collision_check=False,      # 容易自锁========
            maximum_trajectory_dt=1, # 0.25
            finetune_dt_scale=1.25, # 1.05
            fixed_iters_trajopt=True,
            finetune_trajopt_iters=300,
            minimize_jerk=True,
        )
        self.motion_gen = MotionGen(motion_gen_config)
        # import ipdb;ipdb.set_trace()
        print("warming up..")
        self.motion_gen.warmup(warmup_js_trajopt=False)

        self.world_model = self.motion_gen.world_collision

    def update_rgbd(self, depth, intrinsics, camera_pose, visual=False):
        self.world_model.decay_layer("world")

        camera_pose2 = format_pose.from_transforma_mat(camera_pose)
        # self.world_model.clear_blox_layer('world')
        # Update RGB-D data into NVBlox and compute colision
        # import pdb;pdb.set_trace()
        data_camera = CameraObservation(
                depth_image=depth, intrinsics=intrinsics, pose=camera_pose2
            )
        data_camera = data_camera.to(device=self.tensor_args.device)

        self.world_model.add_camera_frame(data_camera, "world")
        self.world_model.process_camera_frames("world", False)
        torch.cuda.synchronize()
        self.world_model.update_blox_hashes()

        voxels = None

        if visual:
            voxel_size = 0.01
            bounding = Cuboid("t", dims=[1, 1, 1.0], pose=[0, 0, 0, 1, 0, 0, 0])
            voxels = self.world_model.get_voxels_in_bounding_box(bounding, voxel_size)

        # coll_mesh = self.motion_gen.world_coll_checker.get_mesh_in_bounding_box(
        #                 Cuboid(name="test", pose=[0, 0, 0, 1, 0, 0, 0], dims=[1, 1, 1]),
        #                 voxel_size=0.005,
        #             )
        # coll_mesh.save_as_mesh("world_objects.stl")
        return voxels

    def attach_target(self, 
                      ee_pose=None,
                      sphere_fit_type = SphereFitType.VOXEL_VOLUME_SAMPLE_SURFACE,
                      surface_sphere_radius = 0.001,
                      voxelize_method = 'ray',
                      remove_obstacles_from_world_config = False,
                      n_spheres=6,
                      link_name = "attached_object",
                      ):

        pos, quater = ee_pose
        ee_pose = Pose(
            position=self.tensor_args.to_device(pos),
            quaternion=self.tensor_args.to_device(quater),
        )

        world_objects_pose_offset = Pose.from_list([0, 0, 0.005, 1, 0, 0, 0], self.tensor_args)
        if world_objects_pose_offset is not None:
            ee_pose = world_objects_pose_offset.inverse().multiply(ee_pose)

        obs = Cuboid(
                name='target_obj',
                pose = [pos[0], pos[1], pos[2], 1, 0, 0, 0],
                dims = [0.1, 0.1, 0.2],
            )
        sph = obs.get_bounding_spheres(
            n_spheres,
            surface_sphere_radius,
            pre_transform_pose=ee_pose,
            tensor_args=self.tensor_args,
            fit_type=sphere_fit_type,
            voxelize_method=voxelize_method,
        )
        sph_list = [s.position + [s.radius] for s in sph]

        spheres = self.tensor_args.to_device(torch.as_tensor(sph_list))

        max_spheres = self.motion_gen.robot_cfg.kinematics.kinematics_config.get_number_of_spheres(link_name)
        if spheres.shape[0] > max_spheres:
            spheres = spheres[: spheres.shape[0]]

        sphere_tensor = torch.zeros((max_spheres, 4))
        sphere_tensor[:, 3] = -10.0
        sphere_tensor[: spheres.shape[0], :] = spheres.contiguous()

        self.motion_gen.attach_spheres_to_robot(sphere_tensor=sphere_tensor, link_name=link_name)

    def plan(self, 
            joint_pos, 
            target_pose,
            ee_pose=None,
            attach_target=False,
            ):
        # Motion Plan with current state and target pose
        joint_state = JointState(
                position=self.tensor_args.to_device(joint_pos),
                velocity=self.tensor_args.to_device(np.zeros_like(joint_pos)),
                acceleration=self.tensor_args.to_device(np.zeros_like(joint_pos)),
                jerk=self.tensor_args.to_device(np.zeros_like(joint_pos)),
                joint_names=self.motion_gen.kinematics.joint_names,
            )

        joint_state = joint_state.get_ordered_joint_state(
            self.motion_gen.kinematics.joint_names
            )
        pos, quater = target_pose
        ik_goal = Pose(
            position=self.tensor_args.to_device(pos),
            quaternion=self.tensor_args.to_device(quater),
        )

        if attach_target:
            self.attach_target(ee_pose)

        # Planning
        # import pdb;pdb.set_trace()
        t0 = time.time()
        result = self.motion_gen.plan_single(joint_state.unsqueeze(0), ik_goal, self.plan_config)
        t1 = time.time()
        print('Plan once with %dms'%(int((t1-t0)*1000)))
        succ = result.success.item() 

        # Plan Results
        cmd_plan = None
        if succ:
            # import pdb;pdb.set_trace()
            cmd_plan = result.get_interpolated_plan()
            cmd_plan = self.motion_gen.get_full_js(cmd_plan).position.cpu()
        return cmd_plan
balakumar-s commented 10 months ago

We pushed a fix to nvblox_torch that should also fix this error. Can you pull the latest main for nvblox_torch and try it?