cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
https://cvat.ai
MIT License
11.77k stars 2.88k forks source link

Annotate rigid objects in 2D image with standard 3D cube #3387

Open hnuzhy opened 3 years ago

hnuzhy commented 3 years ago

My actions before raising this issue

I have read and searched the official docs and past issues for the solution. No one had the same problem with me.

Expected Behaviour

I want to annotate the head orientation of people in 2D image with a standard 3D cube. Here, the head is a rigid object. A standard cube is defined as follows: three sides of any vertex are perpendicular to each other, and all twelve sides are equal in length, or in unit length.

img1

After labeling, we could get the eight projected vertices of the cube in the two-dimensional coordinate system. If three Euler angles (pitch, yaw, roll) are used to represent the orientation of the head, these precise projection points can be converted into corresponding angles.

img2

Current Behaviour

img3

Possible Solution

I have three suggestions or roadmaps for adding unit cube label in the new version of CVAT.

1) Improve cuboid The current cuboid is actually oblique. However, objects in the real world should be marked with regular cuboids which satisfy that three edges of each vertex are perpendicular. At the same time, we need to release the third dimension of cuboid and allow it to rotate freely. I don't know if it's easy to implement with TypeScript. Three.js and other open source packages may be used for reference.

2) Modify cuboid-3d As far as I know, recent versions of CVAT already support 3D point cloud annotation. So is it possible to transplant the 3D cuboid module to the 2D image annotation? I'm not very familiar with the content of point cloud annotation, so it's inconvenient for me to give my opinions.

3) Add cube If possible, consider adding a new cube label to the candidate label button on the left side of CVAT. Users could choose to add new 3D cube graphics. The cube instance supports rotation at any angle on three dimensions. The software will automatically record the final Euler angle when the shape of cube is fixed.

Here are two examples of 3D model interaction. The first is the rotation interaction of a 3D head model in mayavi. The interactive operation needs to rely on both mouse and keyboard. The second is to use the 3D image editing tool in Windows 10 to place and operate 3D models on 2D images. All you need to do is use the mouse.

demo1_pymayavi-3D_head_model Example 1

demo2_windows10_3D_edit Example 2

Next steps

Looking forward to your reply. I will be willing to do whatever I can to advance this functional part.

chiehpower commented 2 years ago

this is so cool feature ...

nmanovic commented 2 years ago

@hnuzhy , I agree that we need to improve the functionality. Your explanation is really helpful. Could you please describe your research area and organization? Unfortunately my team has huge amount of requests and we already have an approximate roadmap for Q3'21 and Q4'21. Thus I'm trying to clarify details which will help me to increase the priority of the feature.

hnuzhy commented 2 years ago

this is so cool feature ...

Yes, it is a pretty cool function which is not easy to realize :-(

hnuzhy commented 2 years ago

@hnuzhy , I agree that we need to improve the functionality. Your explanation is really helpful. Could you please describe your research area and organization? Unfortunately my team has huge amount of requests and we already have an approximate roadmap for Q3'21 and Q4'21. Thus I'm trying to clarify details which will help me to increase the priority of the feature.

@nmanovic Hi, I'm glad you agree to my proposal. I am a PhD student in computer department from SJTU University. My research field is the intersection of AI and education. The detailed research direction is object detection and pose estimation in computer vision. I would like to talk about the motivation of this question from two aspects.

Aspect one: Academic Value

Recently, I've been studying the methods of attention detection for students in the classroom. Among them, head orientation (head pose estimation) is one of the key factors. However, as far as I know, the head pose estimation algorithm of multi-person in 2D image is not well developed. At present, there are some SOTA algorithms for head pose estimation of a single well cropped head, including FSA-Net(CVPR2019) and WHE-Net(BMVC2020). But their effect is not ideal, and it is not easy to extend to the case of multiple people in a single image. Most importantly, the datasets used by these algorithms are obtained by 3D head projection (300W-LP & AFLW2000-3D), or the 3D Euler collected by depth camera in the experimental scene (CMU Panoptic Studio Dataset).

demo_FSANet Prediction example 1 of FSA-Net (The input can only be a single person's head with visible face.)

demo_FSANet_multiple Prediction example 2 of FSA-Net (First, the head bbox of each person is detected by MTCNN, and then the single head is estimated. Therefore, this is not an efficient or essential multi-person head pose estimation algorithm.)

demo_WHENet_360 Prediction example of WHE-Net (The input can only be a single person's head with wide range pose. The predictable yaw angle of the head is omnidirectional.)

Dataset has always been the cornerstone of deep learning algorithms, so is head pose estimation. Therefore, I want to try to annotate the 3D head orientation, or three Euler angles of the head directly in the 2D image. As mentioned for the first time in this issue, the most accurate annotation scheme focuses on how to use 3D cube to interact freely on 2D images. In my opinion, once such a dataset is constructed, it will help promote the great progress of the corresponding algorithm research. For example, a bottom-up method could be designed to directly predict the pose of all heads in the image at one time. At the same time, compared with a single captured head image, the complete scene and human body information in the original image can assist more accurate head pose estimation.

Aspect two: Enhancement Feasibility:

After investigation, I didn't find tools with real 3D cube annotation. Fortunately, close functional options were found in CVAT. The first is Draw new cuboid. However, the new builded cuboid lacks rotation freedom. The second is Draw new polyline. By annotating three consecutive non coplanar edges of a 3D cube approaching the head orientation, we can deduce the approximate Euler angle. Unfortunately, there is a great subjectivity error in this annotation process. We can't see the actual pose of the generated cube directly, unless we use a real 3D cube to annotate interactively. If we use this method reluctantly, the credibility of the final annotation will be questioned.

Here are three examples of rough annotation results with Draw new polyline. Images are all from the public CrowdHuman dataset. The object we annotate is the head with any orientation in the image, including the visible, occluded and invisible face. In many cases, the current method of polyline annotation is difficult and inaccurate.

issue_anno_img1

issue_anno_img2

issue_anno_img3

In a word, it is very useful to add interactive annotation of rigid 3D graphics (which can only be rotated, translated and scaled) to 2D images. In addition to supporting the head orientation marking, the new function can also be extended to the annotation of other rigid objects. After the construction of similar datasets about general objects, we can try to develop a simple and direct 3D object pose estimation algorithm only based on 2D images. We expect that this method can be comparable to estimation algorithms based on RGB-D or 3D point cloud.

Finally, I am not good at giving the overall improvement framework of CVAT about this enhancement from UI design or code addition, but I am willing to do what I can. I sincerely thank CVAT's main contributors for their work, and hope to carefully consider adding this task to roadmap.

Kucev commented 2 years ago

I support the request. We also have a need for such functionality.

schliffen commented 2 years ago

This is a growing request from automotive industry as well, we need cuboid annotations to be done on RGB images not points clouds.

hnuzhy commented 2 years ago

This is a growing request from automotive industry as well, we need cuboid annotations to be done on RGB images not points clouds.

Yes, you are right. Actually, I have written a simple 2D head pose annotation tool using PyQt + Mayavi last year. As shown below, the annotator can label one head with adding a bounding box in the 2D image, and adjust the 3D head model through mouse or keyboard in the right area to make it have the similar orientation/pose with boxed head. The co-existed 3D cube will be projected in the 2D image. For every appearing head pose status, we will record the corresponding Euler angles. HeadPoseAnnotationUI0 However, this tool can only run in desktop, and is not fully as what I expected originally (refer above question for details). Then, I was busy on other things until now. I did not update and perfect this tool for a long time. If possible, I still look forward to seeing this annotation function in CVAT.

ATT1KA commented 1 month ago

Seconding this, this would be immensely valuable for pose estimation of objects in robotics.