erkil1452 / gaze360

Code for the Gaze360: Physically Unconstrained Gaze Estimation in the Wild Dataset
http://gaze360.csail.mit.edu
Other
225 stars 42 forks source link

some question about generate ground truth #42

Closed jxncyym closed 2 years ago

jxncyym commented 2 years ago

@erkil1452 hello, I have some questions:

  1. in the article, you said"We compute the gaze vector in the Ladybug coordinate system as a simple difference gL =pt − pe. " so what the pe represent, is the right eye 3d coordinate or the left eye 3d coordinate?

  2. you describe the process to get target 3d coordinate as that "We use the original AprilTag library to detect the marker in each of the camera views and estimate its 3D pose using the known camera calibration parameters and marker size. We then use the pose and known board geometry to find the 3D location of the target cross pt." I understand the AprilTag can get the 2d coordinate of the marker, then how to get the target 3d coordinate, could you describe the process detail? or you can give an example to describe the process, such that : the detected marker position is (20,50), then maker size is 20 pixel, ......

  3. in the paper, you use 7 pictures to estimate the gaze of the middle picture, do you evaluate the performance of using 5 pictures or 3 pictures?

  4. I notice a new gaze dataset:ETHX-Gaze,they collect the dataset use 2d camera, but I don't find the way to get the ground truth, do you know how they get the gaze label?

erkil1452 commented 2 years ago
  1. It is the cyclopean eye (mean of both eyes). We do not detect the eyes explicitly so the point is just estimated based on the AlphaPose skeleton.
  2. The tag has a known size (and it is relatively large) so we can get its position from the size in the image. This assumes we know the camera intrinsics, the pixel coordinates of the marker's corners (therefore the 3D view rays) and the physical size of the markers corners. From here we need to find the 3D rotation and translation of the marker that fits the size and shape contraint.
  3. We use 7 frames. I believe we did test fewer as well but I do not see the results anywhere. Presumably they were at least a little bit worse. The MSE Static in Table 2 should show the 1 frame case.
  4. It seems they just ask people to look at a target they choose for them so they know where they look: https://ait.ethz.ch/projects/2020/ETH-XGaze/ . They also know the size of the screen and from the look of the head rest, they will also be able to control the position of the participant. So they have all 3D points under control.
jxncyym commented 2 years ago

@erkil1452 thank you for your reeply.

  1. For the fourth question, I want to confirm something about the process of getting the groundtruth .my understanding is that: because the datas are collected in the lab, if we set a point(assume the left top corner of the screen) as the origin of the world coordinate system, so they know every 3D world coordinate of all the lab(include every point of the screen), then they can use Zhengyou Zhang calibration to get the rotation and translation matrix for every camera. then the rotation and translation matrix can be used to covert these 3D world coordinate to camera coordinate, and use the camera coordinate to compute the ground truth. I don't know whether my understanding is right?
  2. another question is : once the camera is calibrated(assume using the method of Zhengyou Zhang calibration), then the rotation and translation matrix will not changed in any environment?
  3. why you use the head image instead of face image to train the model? and do you evaluate the performance of using face image?
erkil1452 commented 2 years ago
  1. You can still use known camera position (from previous calibration) as the origin. Subtracting the position of the gaze target on the screen and head position will give the gaze vector.
  2. Intrinsic matrix does not change when you move camera. Extrinsic matrix (translation and rotation) does generally change.
  3. In 360 deg setup, the face is often not visible. And we only tested the method using the head crops.
jxncyym commented 2 years ago

@erkil1452 I'm very sorry I still not fully understood what you said. For I'm newer to this, could you describe detail about the process? I guess: first we fix the camera and the screen,then we calibrate the camera to get the rotation and translation matrix. do you mean use the camera position as the origin of the world coordinate system,and use the distance of the camera position and the gaze target to compute the gaze target world coordinate,compute the head position world coordinate in the same way, then subtract the gaze target and head position, we can get the gaze vector,and use the rotation and translation matrix to convert the gaze vector to camera coordinate system, Is that rigtht?If what I said is not right, could you describe the process detail? for I am a newer to this.

erkil1452 commented 2 years ago

Yes, it is as you say.

jxncyym commented 2 years ago

thank you very much