Open hnuzhy opened 3 years ago
this is so cool feature ...
@hnuzhy , I agree that we need to improve the functionality. Your explanation is really helpful. Could you please describe your research area and organization? Unfortunately my team has huge amount of requests and we already have an approximate roadmap for Q3'21 and Q4'21. Thus I'm trying to clarify details which will help me to increase the priority of the feature.
this is so cool feature ...
Yes, it is a pretty cool function which is not easy to realize :-(
@hnuzhy , I agree that we need to improve the functionality. Your explanation is really helpful. Could you please describe your research area and organization? Unfortunately my team has huge amount of requests and we already have an approximate roadmap for Q3'21 and Q4'21. Thus I'm trying to clarify details which will help me to increase the priority of the feature.
@nmanovic Hi, I'm glad you agree to my proposal. I am a PhD student in computer department from SJTU University. My research field is the intersection of AI and education. The detailed research direction is object detection and pose estimation in computer vision. I would like to talk about the motivation of this question from two aspects.
Aspect one: Academic Value
Recently, I've been studying the methods of attention detection for students in the classroom. Among them, head orientation (head pose estimation) is one of the key factors. However, as far as I know, the head pose estimation algorithm of multi-person in 2D image is not well developed. At present, there are some SOTA algorithms for head pose estimation of a single well cropped head, including FSA-Net(CVPR2019) and WHE-Net(BMVC2020). But their effect is not ideal, and it is not easy to extend to the case of multiple people in a single image. Most importantly, the datasets used by these algorithms are obtained by 3D head projection (300W-LP & AFLW2000-3D), or the 3D Euler collected by depth camera in the experimental scene (CMU Panoptic Studio Dataset).
Prediction example 1 of FSA-Net (The input can only be a single person's head with visible face.)
Prediction example 2 of FSA-Net (First, the head bbox of each person is detected by MTCNN, and then the single head is estimated. Therefore, this is not an efficient or essential multi-person head pose estimation algorithm.)
Prediction example of WHE-Net (The input can only be a single person's head with wide range pose. The predictable yaw angle of the head is omnidirectional.)
Dataset has always been the cornerstone of deep learning algorithms, so is head pose estimation. Therefore, I want to try to annotate the 3D head orientation, or three Euler angles of the head directly in the 2D image. As mentioned for the first time in this issue, the most accurate annotation scheme focuses on how to use 3D cube to interact freely on 2D images. In my opinion, once such a dataset is constructed, it will help promote the great progress of the corresponding algorithm research. For example, a bottom-up method could be designed to directly predict the pose of all heads in the image at one time. At the same time, compared with a single captured head image, the complete scene and human body information in the original image can assist more accurate head pose estimation.
Aspect two: Enhancement Feasibility:
After investigation, I didn't find tools with real 3D cube annotation. Fortunately, close functional options were found in CVAT. The first is Draw new cuboid
. However, the new builded cuboid lacks rotation freedom. The second is Draw new polyline
. By annotating three consecutive non coplanar edges of a 3D cube approaching the head orientation, we can deduce the approximate Euler angle. Unfortunately, there is a great subjectivity error in this annotation process. We can't see the actual pose of the generated cube directly, unless we use a real 3D cube to annotate interactively. If we use this method reluctantly, the credibility of the final annotation will be questioned.
Here are three examples of rough annotation results with Draw new polyline
. Images are all from the public CrowdHuman dataset. The object we annotate is the head with any orientation in the image, including the visible, occluded and invisible face. In many cases, the current method of polyline
annotation is difficult and inaccurate.
In a word, it is very useful to add interactive annotation of rigid 3D graphics (which can only be rotated, translated and scaled) to 2D images. In addition to supporting the head orientation marking, the new function can also be extended to the annotation of other rigid objects. After the construction of similar datasets about general objects, we can try to develop a simple and direct 3D object pose estimation algorithm only based on 2D images. We expect that this method can be comparable to estimation algorithms based on RGB-D or 3D point cloud.
Finally, I am not good at giving the overall improvement framework of CVAT about this enhancement from UI design or code addition, but I am willing to do what I can. I sincerely thank CVAT's main contributors for their work, and hope to carefully consider adding this task to roadmap.
I support the request. We also have a need for such functionality.
This is a growing request from automotive industry as well, we need cuboid annotations to be done on RGB images not points clouds.
This is a growing request from automotive industry as well, we need cuboid annotations to be done on RGB images not points clouds.
Yes, you are right. Actually, I have written a simple 2D head pose annotation tool using PyQt + Mayavi
last year. As shown below, the annotator can label one head with adding a bounding box in the 2D image, and adjust the 3D head model through mouse or keyboard in the right area to make it have the similar orientation/pose with boxed head. The co-existed 3D cube will be projected in the 2D image. For every appearing head pose status, we will record the corresponding Euler angles.
However, this tool can only run in desktop, and is not fully as what I expected originally (refer above question for details). Then, I was busy on other things until now. I did not update and perfect this tool for a long time. If possible, I still look forward to seeing this annotation function in CVAT
.
Seconding this, this would be immensely valuable for pose estimation of objects in robotics.
Hello everyone. For those who are interested in this question, you can refer to the 2D head pose annotation tool I mentioned in https://github.com/hnuzhy/HeadAttribute/.
My actions before raising this issue
I have read and searched the official docs and past issues for the solution. No one had the same problem with me.
Expected Behaviour
I want to annotate the head orientation of people in 2D image with a standard 3D cube. Here, the head is a rigid object. A standard cube is defined as follows: three sides of any vertex are perpendicular to each other, and all twelve sides are equal in length, or in unit length.
After labeling, we could get the eight projected vertices of the cube in the two-dimensional coordinate system. If three Euler angles (pitch, yaw, roll) are used to represent the orientation of the head, these precise projection points can be converted into corresponding angles.
Current Behaviour
Current cuboid annotation The current provided cuboid annotation function in CVAT is not suitable for rigid object. 1) Firstly, it can not guarantee that the edges of each vertex of the labeled cuboid are perpendicular to each other. 2) Secondly, the length, width and height of cuboids are not necessarily equal. 3) Finally, the side face of the current cuboid is always vertical. They can't be rotated. This makes it lack a dimension. These conditions make cuboid can not be used to mark the head orientation. In addition, I also think that such a cuboid is not suitable for labeling cars, chairs and other rigid objects.
Alternative choice: ployline As an alternative, I try to annotate three consecutive non planar edges of the cube by using the
ployline
label. In this way, four points of the three edges can be used to estimate the Euler angles. However, this alternative can only solve the third problem ofcuboid
label mentioned above, and the first and second problems have not been solved. What we actually get are the rotated cuboids.Possible Solution
I have three suggestions or roadmaps for adding unit
cube
label in the new version of CVAT.1) Improve cuboid The current cuboid is actually oblique. However, objects in the real world should be marked with regular cuboids which satisfy that three edges of each vertex are perpendicular. At the same time, we need to release the third dimension of cuboid and allow it to rotate freely. I don't know if it's easy to implement with TypeScript. Three.js and other open source packages may be used for reference.
2) Modify cuboid-3d As far as I know, recent versions of CVAT already support 3D point cloud annotation. So is it possible to transplant the 3D cuboid module to the 2D image annotation? I'm not very familiar with the content of point cloud annotation, so it's inconvenient for me to give my opinions.
3) Add cube If possible, consider adding a new cube label to the candidate label button on the left side of CVAT. Users could choose to add new 3D cube graphics. The cube instance supports rotation at any angle on three dimensions. The software will automatically record the final Euler angle when the shape of cube is fixed.
Here are two examples of 3D model interaction. The first is the rotation interaction of a 3D head model in mayavi. The interactive operation needs to rely on both mouse and keyboard. The second is to use the 3D image editing tool in Windows 10 to place and operate 3D models on 2D images. All you need to do is use the mouse.
Example 1
Example 2
Next steps
Looking forward to your reply. I will be willing to do whatever I can to advance this functional part.