NVlabs / few_shot_gaze

Pytorch implementation and demo of FAZE: Few-Shot Adaptive Gaze Estimation (ICCV 2019, oral)
https://research.nvidia.com/publication/2019-10_Few-Shot-Adaptive-Gaze
Other
320 stars 76 forks source link

Predicted point of regard ~10x bigger on demo #20

Open rogeriochaves opened 3 years ago

rogeriochaves commented 3 years ago

Hello there!

For some reason the predicted PoR is way off screen, to try to debug it, on an already trained network I ran the person calibration again, then saved the gaze_n_vector variable used during training and g_cnn variable used during prediction on frame_processor.py, and if I plot them separate I get this:

image

Leaving the clear error aside, if I plot them together I get this:

image

now if I fit a linear regression I get a coef of almost exactly 0.1 for both

image

now by applying those I get a prediction that makes more sense

image

why is that? Is some part of the calculation missing during prediction on frame_processor.py? Why is PoR always 10x bigger?

shalinidemello commented 3 years ago

Did you calibrate your camera to obtain its intrinsic parameters and more importantly, the extrinsic parameters (rotation, translation) between your camera and your monitor as described https://github.com/NVlabs/few_shot_gaze/blob/master/demo/README.md, step 2b?

rogeriochaves commented 3 years ago

thanks for replying! no, to be honest I skipped this step at first, but now I tried to follow through again. I couldn't make the newer Ver. 2 work, neither the matlab version (I'm using octave) nor the C version. But I could make the Ver. 1 Matlab version work (I think?), but I don't know what to do with the outputs

Average reprojection error by TNM : 0.169550 pixel.

==== Parameters by TNM ====
R =

  -0.914077  -0.030128   0.404420
  -0.020261   0.999384   0.028656
  -0.405034   0.018000  -0.914124

T =

   285.23
   100.70
   159.83

n1 =

   0.333519
   0.048892
  -0.941475

n2 =

   0.484883
   0.098207
  -0.869048

n3 =

   0.047996
   0.011300
  -0.998784

d1 = 351.78
d2 = 318.01
d3 = 377.73
points =

   285.234   100.701   159.834
   125.271    97.155    88.953
   282.221   200.639   161.634
   285.234   100.701   159.834

points =

    84.2177    71.2328   727.2730     1.0000
   -84.5563    66.3955   681.2629     1.0000
    79.7462   170.9574   733.1905     1.0000
    84.2177    71.2328   727.2730     1.0000

points =

   -32.1676    36.4152   728.7079     1.0000
  -176.3115    36.0734   629.4738     1.0000
   -41.7646   135.0200   742.3086     1.0000
   -32.1676    36.4152   728.7079     1.0000

points =

   262.8758    95.4368   625.1049     1.0000
    96.8575    90.4655   680.2247     1.0000
   259.9411   195.3936   625.2806     1.0000
   262.8758    95.4368   625.1049     1.0000

points =

   184.726    85.967   443.553
    20.357    81.775   385.108
    16.615   181.607   388.967
   180.984   185.798   447.412
   184.726    85.967   443.553

points =

   126.533    68.558   444.271
   -25.520    66.614   359.213
   -31.825   165.886   366.914
   120.228   167.830   451.971
   126.533    68.558   444.271

points =

   274.055    98.069   392.469
   111.064    93.810   384.589
   108.090   193.758   385.577
   271.081   198.016   393.457
   274.055    98.069   392.469

how should I tweak monitor.py based on this, can you give me an example? It's a lot of numbers and documentation is not clear

FYI, I'm using a Macbook Pro webcam if that makes things simpler

swook commented 3 years ago

I am not 100% sure, but I imagine that you need to update https://github.com/NVlabs/few_shot_gaze/blob/master/demo/monitor.py#L28 (and its inverse) based on these values that you determined:

T =

   285.23
   100.70
   159.83

I imagine that this is the translation from the screen to camera coordinate system in millimeters.

So for example, you could probably define (yet again, I'm not 100% sure):

    def monitor_to_camera(self, x_pixel, y_pixel):

        x_cam_mm = 285.23 + ((int(self.w_pixels/2) - x_pixel)/self.w_pixels) * self.w_mm
        y_cam_mm = 100.7 + (y_pixel/self.h_pixels) * self.h_mm
        z_cam_mm = 159.83

        return x_cam_mm, y_cam_mm, z_cam_mm

and a corresponding camera_to_monitor

zwfcrazy commented 3 years ago

I am not 100% sure, but I imagine that you need to update https://github.com/NVlabs/few_shot_gaze/blob/master/demo/monitor.py#L28 (and its inverse) based on these values that you determined:

T =

   285.23
   100.70
   159.83

I imagine that this is the translation from the screen to camera coordinate system in millimeters.

So for example, you could probably define (yet again, I'm not 100% sure):

    def monitor_to_camera(self, x_pixel, y_pixel):

        x_cam_mm = 285.23 + ((int(self.w_pixels/2) - x_pixel)/self.w_pixels) * self.w_mm
        y_cam_mm = 100.7 + (y_pixel/self.h_pixels) * self.h_mm
        z_cam_mm = 159.83

        return x_cam_mm, y_cam_mm, z_cam_mm

and a corresponding camera_to_monitor

Not exactly. The code in the following places: https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/frame_processor.py#L178 https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/frame_processor.py#L218 https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/monitor.py#L28 https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/monitor.py#L38 assume that the z axis of the camera and the z axis of the monitor are parallel and there is no translation in the z direction, i.e. z=0. However, from the R and T given by @rogeriochaves, it can be seen that neither of the two assumptions stands. In order to correctly apply the calibration results, you need to

  1. apply a full coordinate transformation in monitor.py by using not only the translation vector T but also the rotation matrix R.
  2. change the way of calculating the POR by not assuming z=0, you should find the intersection between the gaze vector and the monitor plane (usually the xy plane of the monitor).

BTW, the R and T given by the calibration process actually describes the relationship between the chessboard pattern displayed on the monitor and the camera. It may not equal to the relationship between the monitor and the camera. You need to find the relationship between the chessboard pattern and the monitor as well.

ShreshthSaxena commented 3 years ago

I am not 100% sure, but I imagine that you need to update https://github.com/NVlabs/few_shot_gaze/blob/master/demo/monitor.py#L28 (and its inverse) based on these values that you determined:

T =

   285.23
   100.70
   159.83

I imagine that this is the translation from the screen to camera coordinate system in millimeters. So for example, you could probably define (yet again, I'm not 100% sure):

    def monitor_to_camera(self, x_pixel, y_pixel):

        x_cam_mm = 285.23 + ((int(self.w_pixels/2) - x_pixel)/self.w_pixels) * self.w_mm
        y_cam_mm = 100.7 + (y_pixel/self.h_pixels) * self.h_mm
        z_cam_mm = 159.83

        return x_cam_mm, y_cam_mm, z_cam_mm

and a corresponding camera_to_monitor

Not exactly. The code in the following places: https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/frame_processor.py#L178

https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/frame_processor.py#L218

https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/monitor.py#L28

https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/monitor.py#L38

assume that the z axis of the camera and the z axis of the monitor are parallel and there is no translation in the z direction, i.e. z=0. However, from the R and T given by @rogeriochaves, it can be seen that neither of the two assumptions stands. In order to correctly apply the calibration results, you need to

  1. apply a full coordinate transformation in monitor.py by using not only the translation vector T but also the rotation matrix R.
  2. change the way of calculating the POR by not assuming z=0, you should find the intersection between the gaze vector and the monitor plane (usually the xy plane of the monitor).

BTW, the R and T given by the calibration process actually describes the relationship between the chessboard pattern displayed on the monitor and the camera. It may not equal to the relationship between the monitor and the camera. You need to find the relationship between the chessboard pattern and the monitor as well.

so is the tnm monitor calibration needed for a default laptop webcam (the assumptions of z=0 and Δy = 10 mm fits)? I've got the model to run but I'm wondering if there's some way to improve accuracy further by calibration ?

zwfcrazy commented 3 years ago

I am not 100% sure, but I imagine that you need to update https://github.com/NVlabs/few_shot_gaze/blob/master/demo/monitor.py#L28 (and its inverse) based on these values that you determined:

T =

   285.23
   100.70
   159.83

I imagine that this is the translation from the screen to camera coordinate system in millimeters. So for example, you could probably define (yet again, I'm not 100% sure):

    def monitor_to_camera(self, x_pixel, y_pixel):

        x_cam_mm = 285.23 + ((int(self.w_pixels/2) - x_pixel)/self.w_pixels) * self.w_mm
        y_cam_mm = 100.7 + (y_pixel/self.h_pixels) * self.h_mm
        z_cam_mm = 159.83

        return x_cam_mm, y_cam_mm, z_cam_mm

and a corresponding camera_to_monitor

Not exactly. The code in the following places: https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/frame_processor.py#L178

https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/frame_processor.py#L218

https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/monitor.py#L28

https://github.com/NVlabs/few_shot_gaze/blob/2b0ea42ecba456ede03a60c11a94dd62a45dc287/demo/monitor.py#L38

assume that the z axis of the camera and the z axis of the monitor are parallel and there is no translation in the z direction, i.e. z=0. However, from the R and T given by @rogeriochaves, it can be seen that neither of the two assumptions stands. In order to correctly apply the calibration results, you need to

  1. apply a full coordinate transformation in monitor.py by using not only the translation vector T but also the rotation matrix R.
  2. change the way of calculating the POR by not assuming z=0, you should find the intersection between the gaze vector and the monitor plane (usually the xy plane of the monitor).

BTW, the R and T given by the calibration process actually describes the relationship between the chessboard pattern displayed on the monitor and the camera. It may not equal to the relationship between the monitor and the camera. You need to find the relationship between the chessboard pattern and the monitor as well.

so is the tnm monitor calibration needed for a default laptop webcam (the assumptions of z=0 and Δy = 10 mm fits)? I've got the model to run but I'm wondering if there's some way to improve accuracy further by calibration ?

Every laptop hardware configuration is different but the assumption of z=0 should be OK to use. But you need to at least measure Δy and Δx using a ruler if you really don't want to do the cailbration. (Δy = the distance between the camera and the upper edge of the monitor; Δx = the distance between the camera and the left edge of the monitor, usually equal to monitor width / 2.)

However, a good calibration won't help improve accuray in this case. I believe the accuracy is limited by the image resolution. I did an experiment and it turned out that you almost cannot recognize the eye movement in images taken for two target points that are less than 2cm apart on the screen. Increasing image resolution might be a solution but this will also increase the complexity of the neural network and you need to build a high resolution training dataset as well. So I think this still remains an open problem.