cviviers / YOLOv5-6D-Pose

6-DoF Pose estimation based on the YOLOv5 framework. Specific focus on instruments in X-ray applications
https://ieeexplore.ieee.org/document/10478293
GNU General Public License v3.0
29 stars 6 forks source link

How can i make dataset? #9

Open lsy-92 opened 1 month ago

lsy-92 commented 1 month ago

Thanks for your amazing job.

i make a custom dataset. but my training is wrong. my loss is 0. so i want to ask for dataset.

Do i normalized Object Size Range by image width and height?

And Why are the rotation vector and translation vector missing from the labels in the LINEMOD dataset?

Thanks

Label Format Each of the 21 numbers corresponds to specific data points: 0. Class Label

Centroid Coordinates: x0 (x-coordinate), y0 (y-coordinate) Corner Coordinates: From x1, y1 (first corner) to x8, y8 (eighth corner) Object Size Range: x range, y range Focal length x Focal length y Sensor width (in optical this can just be the image size) Sensor height (in optical this can just be the image size) Focal offset x (u0) Focal offset y (v0) Image width Image height Object rotation vector (Rodriques) [3x1] Object translation vector [3x1] Note: Coordinates are normalized by image width and height (x / image_width, y / image_height).

cviviers commented 1 month ago

Hi @lsy-92,

Yes, thanks for pointing this out. The object size range / object width and height should also be normalized by the image width & height. If you are using the create_label function in the provided code it will do this correctly based off of the object keypoints.

I will update the description to make this clear.

The rotation vector and translation vector is not used in the code at the moment. I added it so that I can do a check on predicted poses later on or if I wanted to switch to an alternative approach that does direct pose estimation, such as EfficientPose. Feel free to exclude it if you like.

lsy-92 commented 1 month ago

Thanks for your answer.

I make my dataset using Blender.

i was modeling my object, and extract necessary information using Blender.

But i think it is not correct.

If I want to make custom dataset for using your project, should i use your creat_dataset.ipynb?

Thanks.

cviviers commented 1 month ago

Blender is definitely the correct starting point. Do you plan on acquiring the poses in blender as well or do you have a real object that you want to capture images of? I used blender to create 3D models of all the objects (exported as .ply files). Acquire real poses using the description in create_dataset.ipynb, and in addition, project the 3D model onto the real images to get fine-grained masks to train with. This will help immensely with robustness of the model.

lsy-92 commented 1 month ago

I get my all information like poses and 3d bbox coordinates using Pytorch3d and Blender. I think that method is a little bit cheating, but i think it is possible. But it doesn't work.

lsy-92 commented 1 month ago

I'm sorry for asking too many questions.

I try detecting Linemod object cat image using pretrained yolov5s.pt.

Although some images are not object detected, yolov5-6d-pose Model train well.

But My Custom Dataset can't train because of in PoseLoss

n = b.shape[0]  # number of targets
        if n:

n = 0.

What am I doing wrong?

Thanks.

cviviers commented 1 month ago

Hi @lsy-92,

It is no problem to ask questions :) If you show me some examples of your blender setup and data being generated I can see if there is anything wrong.

For clarification, the YOLOv5s, YOLOv5x are COCO pretrained. We only use them to initialize the model weights for pose estimation training. Use the linemod object weights for pose estimation.

The error you describe seems like there is no label in the label file for that particular image. To make sure your images and labels are correct, before and after augmentation, you can change line 741 to True. This will display the images.

lsy-92 commented 1 month ago

I use pytorch3d.

`mesh_R_T = [] for i, mesh in enumerate(meshes): translation = trans_3d[rand_3d[i]].to(device) rotation = rotat_3d[i].to(device) verts = mesh.verts_packed() @ rotation.T verts = verts + translation.unsqueeze(0) faces = mesh.faces_packed() mesh = Meshes(verts=[verts], faces=[faces], textures=mesh.textures)

            mesh_R_T.append(mesh)

        mesh_scene = join_meshes_as_scene(mesh_R_T, include_textures=True)
        mesh_views = mesh_scene.extend(num_views)

        dist = torch.rand(num_views) * (d1 - d0) + d0
        azim = torch.rand(num_views) * (a1 - a0) + a0
        elev = torch.rand(num_views) * (e1 - e0) + e0
        R, T = look_at_view_transform(dist=dist, elev=elev, azim=azim,device=device)

        camera = FoVPerspectiveCameras(device=device, R=R, T=T).to(device)
        projection_matrices = camera.get_projection_transform().get_matrix()
        lights = pytorch3d.renderer.PointLights(device=device)
        blend_params = BlendParams(1e-4, 1e-4, (0, 0, 0))         

        rgb_settings = RasterizationSettings(image_size=img_size, blur_radius=0.0, faces_per_pixel=4, bin_size=0)
        rgb_rasterizer = MeshRasterizer(cameras=camera, raster_settings=rgb_settings)
        rgb_shader = SoftPhongShader(device=device, cameras=camera, lights=lights, blend_params=blend_params)
        rgb_renderer = MeshRenderer(rasterizer=rgb_rasterizer, shader=rgb_shader)
        rgb_images = rgb_renderer(mesh_views, cameras=camera, lights=lights)
        rgb_targets = [rgb_images[i, ..., :3] for i in range(num_views)]

        depth_raster_settings = RasterizationSettings(image_size=img_size, blur_radius=0.0, faces_per_pixel=1, bin_size=0)
        depth_rasterizer = MeshRasterizer(cameras=camera, raster_settings=depth_raster_settings)
        depth_fragments = depth_rasterizer(mesh_views, camera=camera)
        depth_zbuf = depth_fragments.zbuf
        depth_images = depth_zbuf.min(dim=-1).values

        seg_raster_settings = RasterizationSettings(image_size=img_size, blur_radius=0.0, faces_per_pixel=1, bin_size=0)
        seg_rasterizer = MeshRasterizer(cameras=camera, raster_settings=seg_raster_settings)
        seg_shader = HardFlatShader(device=device, cameras=camera, lights=lights, blend_params=blend_params)
        seg_renderer = MeshRenderer(rasterizer=seg_rasterizer, shader=seg_shader)`

this is how i make JPEGimages and mask. image image

Then i make label. label = create_label(1, origin_x, origin_y, bbox_3d_list, x_range, y_range, fx, fy, img_size, img_size, u0, v0, img_size, img_size, RRR, TTT)

1 0.575 0.357812 center x, y

0.58872 0.364275 0.633229 0.365654 0.569569 0.487762 0.613545 0.482464 0.533243 0.352083 0.57823 0.354027 0.517102 0.469103 0.561416 0.465034 x1, y1 (first corner) to x8, y8 (eighth corner) 0.854262 1.501989 x range, y range 1.732051 1.732051 focal x, y 640 640 img_size 0.0 0.0 u0,v0 640 640 img_size

and visualization this 3d bbox like this. image

and I make this files like Linemod Dataset. image

cviviers commented 1 month ago

Having a quick look the only issue I can see is that u0, v0 (camera focal point) should be 320 and 320, not 0, assuming you have a prefect camera.

Can you make a projection of the 3D bounding box back onto the image and see if it aligns with your object? There is some code for that in the data_curation.

lsy-92 commented 1 month ago

image this is my projection matrix. and this is my 3d bbox code. def calculate_points_3db_box(point, R, T, projected_matrix, camera, img_shape=(256, 256)): point = point.view(1, 1, 3) origin_xyz = point[:, :, :3] transformed_origin_xyz = torch.bmm(origin_xyz, R) + T.unsqueeze(1) transformed_origin = torch.cat([transformed_origin_xyz, torch.ones_like(transformed_origin_xyz[:, :, :1])], dim=-1) projected_origin = torch.bmm(transformed_origin, projected_matrix.transpose(1, 2)) projected_origin = projected_origin.squeeze(1) print("projected_origin 2 : " ,projected_origin) projected_origin_ndc = projected_origin[:, :3] / projected_origin[:, 3:] print("projected_origin_ndc 2 : " ,projected_origin_ndc) x_pixel = (projected_origin_ndc[:, 0]) + img_shape[0] / 2 y_pixel = (projected_origin_ndc[:, 1]) + img_shape[1] / 2 print(f"3D Box Points -> X: {x_pixel}, Y: {y_pixel}") # 디버깅 정보 출력 return x_pixel, y_pixel

`min_x, min_y, min_z = sampled_points.min(dim=1).values.squeeze() max_x, max_y, max_z = sampled_points.max(dim=1).values.squeeze() bbox_3d = torch.tensor([[min_x, min_y, min_z], [max_x, min_y, min_z], [min_x, max_y, min_z], [max_x, max_y, min_z], [min_x, min_y, max_z], [max_x, min_y, max_z], [min_x, max_y, max_z], [max_x, max_y, max_z]], dtype=torch.float32, device=device)

                bbox_3d_list = []
                origin_x = (bbox[0] + bbox[2]) / 2
                origin_y = (bbox[1] + bbox[3]) / 2
                for i in range(8):
                    x, y = calculate_points_3db_box(bbox_3d[i], RRR, TTT, projection_matrices[0].unsqueeze(0), camera=camera, img_shape=(640, 640))
                    bbox_3d_list.append([x, y])

                seg_pil = seg_pil.convert("RGBA")
                draw = ImageDraw.Draw(seg_pil)
                edges = [
                    (0, 1), (1, 3), (3, 2), (2, 0),
                    (4, 5), (5, 7), (7, 6), (6, 4),
                    (0, 4), (1, 5), (2, 6), (3, 7)
                ]

                for edge in edges:
                    draw.line(
                        (bbox_3d_list[edge[0]][0], bbox_3d_list[edge[0]][1], bbox_3d_list[edge[1]][0], bbox_3d_list[edge[1]][1]),
                        fill="red",
                        width=2
                    )`

And i try change False -> True, but still loss is zero tensor[0].

Thanks

cviviers commented 1 month ago

Hi @lsy-92,

I am currently away for a few days, so can't execute any code. Have you tried visualizing the 3D bounding box projection onto your images? Do they align with the object you are interested in? If you visualize it with the dataloader in YOLOv5-6D and they look good, then you can assume the data you have created is in the correct format for training.

lsy-92 commented 1 month ago

Thank you for your reply during the schedule.

I check dataloader, and save test.png.

image

so i think it looks good.

but always loss is 0.

this is my output log. Image sizes 640 train, 640 test Using 8 dataloader workers Logging results to runs/train/exp70 Starting training for 5000 epochs... Epoch gpu_mem l_obj l_box l_cls n_targets img_size 0%| | 0/28 [00:00<?, ?it/s] autoanchor: thr=0.25: 1.0000 best possible recall, 9.00 anchors past thr autoanchor: n=9, img_size=640, metric_all=0.994/0.999-mean/best, past_thr=0.994-mean: 546,963, 546,969, 547,973, 546,978, 547,977, 547,978, 546,980, 546,983, 547,981 autoanchor: New anchors saved to model. Update model *.yaml to use these anchors in the future. pred : 3 targets : torch.Size([32, 22]) lbox 1 : tensor([0.], device='cuda:0') Layer 0, anchor_t: 4.0, matched targets: 3 Layer 1, anchor_t: 4.0, matched targets: 3 Layer 2, anchor_t: 4.0, matched targets: 3 p : 3 n : 9 l2_dist: tensor(10.20634, device='cuda:0', grad_fn=) lbox 2 : tensor([0.00995], device='cuda:0', grad_fn=) n : 9 l2_dist: tensor(6.61466, device='cuda:0', grad_fn=) lbox 2 : tensor([0.03554], device='cuda:0', grad_fn=) n : 9 l2_dist: tensor(5.59261, device='cuda:0', grad_fn=) lbox 2 : tensor([0.11948], device='cuda:0', grad_fn=) self.hyp['box'] = 1.5 loss = lbox + lcls loss : tensor([5.73502], device='cuda:0', grad_fn=)

0/4999     2.39G     2.175    0.1792         0        32       640:   4%|██▉                                                                                | 1/28 [00:07<03:22,  7.50s/it]

pred : 3 targets : torch.Size([32, 22]) lbox 1 : tensor([0.], device='cuda:0') Layer 0, anchor_t: 4.0, matched targets: 1 Layer 1, anchor_t: 4.0, matched targets: 0 No targets matched for this layer. Layer 2, anchor_t: 4.0, matched targets: 0 No targets matched for this layer. p : 3 n : 3 l2_dist: tensor(10.24166, device='cuda:0', grad_fn=) lbox 2 : tensor([0.01166], device='cuda:0', grad_fn=) n : 0 n : 0 self.hyp['box'] = 1.5 loss = lbox + lcls loss : tensor([0.55961], device='cuda:0', grad_fn=)

0/4999     7.21G     2.175    0.1123         0        32       640:  11%|████████▉                                                                          | 3/28 [00:12<01:28,  3.53s/it]

pred : 3 targets : torch.Size([32, 22]) lbox 1 : tensor([0.], device='cuda:0') Layer 0, anchor_t: 4.0, matched targets: 9 Layer 1, anchor_t: 4.0, matched targets: 6 Layer 2, anchor_t: 4.0, matched targets: 6 p : 3 n : 27 l2_dist: tensor(8.90784, device='cuda:0', grad_fn=) lbox 2 : tensor([0.00896], device='cuda:0', grad_fn=) n : 18 l2_dist: tensor(5.03969, device='cuda:0', grad_fn=) lbox 2 : tensor([0.02829], device='cuda:0', grad_fn=) n : 18 l2_dist: tensor(4.34330, device='cuda:0', grad_fn=) lbox 2 : tensor([0.09337], device='cuda:0', grad_fn=) self.hyp['box'] = 1.5 loss = lbox + lcls loss : tensor([4.48168], device='cuda:0', grad_fn=) 0/4999 7.21G 2.175 0.1123 0 32 640: 11%|████████▉ | 3/28 [00:14<02:01, 4.84s/it] Traceback (most recent call last): File "YOLOv5-6D-Pose/train.py", line 551, in train(hyp, opt, device, tb_writer, wandb) File "YOLOv5-6D-Pose/train.py", line 329, in train scaler.scale(loss).backward() File "anaconda3/envs/yolopose/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "anaconda3/envs/yolopose/lib/python3.9/site-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

cviviers commented 1 month ago

From the looks of it, your model is training. You only have 1 class right? So the classification loss is always 0.

lsy-92 commented 1 month ago

I'm sorry for the poor explanation.

The loss that I said was zero is lbox + lcls.

but I solved the problem by raising anchor_t to 10.0.

And I have a different question.

when will Multiple object pose estimation be updated?

Thanks.

cviviers commented 1 month ago

That's great to hear. Glad you have it working! If you would like to write up your data creating process, we could add it to the repo to help other people wanting to do the same in the future?

The current code already works for training multiple objects, testing just needs some refinement. So feel free to try it out. I am doing some different research now, so will have to work on it as soon as I get some free time.

lsy-92 commented 4 weeks ago

Thank you for your suggestion.

I'd love to, but it's the team member's code.

I think I can add it to the repo after the team member's paper is accepted.

And I have some problems.

Training is good, but center points are something wrong.

image

In this case, can I know the exact value if I calculate the center point and Pose of the 3dbbox inversely?

Thanks.

cviviers commented 3 weeks ago

From the looks of it, the coordinate is on one of the sides of the object instead of the center. In all the object in Linemod and in the X-ray objects I made, the center of the object was at 0,0,0 and the 3D bounding box spaced around that. Example of a 1cm^3 3D cube would be: top left -0.5cm, -0.5cm, -0.5cm and bottom right 0.5cm, 0.5cm, 0.5cm. This doesnt need to be the case, but just makes it a bit easier.

lsy-92 commented 3 weeks ago

Thanks for your explanation.

If i fix camera.json and use this model in real situation, Does the model still perform correctly if the resolution or camera parameters of the training data differ from actual input's?

Thanks.

cviviers commented 3 weeks ago

So the argument we make in the paper is that because the model only does keypoint prediction, we can change the cameras. As long as the keypoint prediction is correct and using the new camera (and corresponding camera parameters), the pose will be correct. To put this a bit more explicitly:

Our experiments were conducted on the X-ray "cameras", but the same should hold for normal cameras.

Does that answer your question?

lsy-92 commented 3 weeks ago

Yes, that answer is absolutly I want.

I try my code changing FoVPerspectiveCameras to PerspectiveCamers.

If I get some good results, I will upload my create dataset code.

Thanks

lsy-92 commented 2 weeks ago

I have another question.

If there is one class, but there are multiple objects of the same class in an image, can I get multiple 3D bboxes?

Thank you always.

cviviers commented 2 weeks ago

Hey @lsy-92,

You can straightforwardly do that by increasing the max detection in the box_filter. Here you can set the "max_det" to the number of objects and confidence you allow. Do be aware, compared to the non maximum suppression (NMS) in earlier YOLO models, there is no real filtering here (You could get two predictions for the exact same object on the exact same location). The grid cells at the different scales could have a high confidence for the same object.

You could use the object bounding boxes in a NMS way to filter these, but I leave that to you :)

lsy-92 commented 1 week ago

I want to test the pt file of this model on detection in yolov5. I want to know how much Feature Extractor is affected and how much it is affected in Pretrained epoch. I'm considering domain gap, and I think this part should be solved.

And I increased max_det to 10, but I get this error.

Traceback (most recent call last):
File "/data/lsy/YOLOv5-6D-Pose/train.py", line 551, in
train(hyp, opt, device, tb_writer, wandb)
File "/data/lsy/YOLOv5-6D-Pose/train.py", line 374, in train
results = test.test(opt.data,
File "/data/lsy/YOLOv5-6D-Pose/test.py", line 203, in test
full_pr = predn[torch.where(predn[:, 19] == k), :]
TypeError: only integer tensors of a single element can be converted to an index

cviviers commented 16 hours ago

Hey @lsy-92,

Apologies for the late response. Yes, I think this will have to be slightly modified. The idea is to get the indices of the elements with predictions that matches the object you are currently trying to evaluate. So get all the predictions for that class (10+ now in your case) and introduce a new for loop to loop over each of them and evaluate the performance of each. You can possibly use the best score on each object as the matching score and discard the rest.

I would like to learn more about the dataset/use case you have for this. Would be interesting for follow up research or some collaboration :)