2 version of annotation of benchmark are inaccurate.. if I did any wrong?

Hello, I'm really appreciate the work you completed, SynergyNet is not only light-weighted but also keep an acceptable accuracy on AFLW2000-3D.
Although I'm also one of trainer who can't reproduce NME 3.4% (best is 3.674% after fix code problem in here) on original annotation of benchmark, I keep trying to analysis what kind of images model be failed on it and try improve through training process.
So, I sorted NME of 2000 images, make a grid of 48 worst images and the model alignment on it, and show the ground truth of annotation beside it. (for each pair, left is model output and right is ground truth. and it's reannotated version).
As you can see, some annotation of Ground Truth is not accurate, (index start from 1) pair of (1,1) (1,2) is obvious that its annotation is not worth for reference..., but by other pair (e.g. (8,6)), it shows that reason of large NME is due to model performance instead of inaccurate annotation, namely, it's still have chance to be improved.

Here is part of my code to post process the file (roi_box, pts68...) you offer in the repo, and visualize the alignment on image. For the inaccurate problem, did I do anything wrong? or is there any opinion you can share for us? I'll be really appreciated for it.

# put this code in ./aflw2000_data/ and you can run it
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path

# you can select by image name
img_name = "image02156.jpg"

img = plt.imread("./AFLW2000-3D_crop/"+img_name)

# choose the version of benchmark annotation (ori or re)
pts68 = np.load("./eval/AFLW2000-3D.pts68.npy")
pts68 = np.load("./eval/AFLW2000-3D-Reannotated.pts68.npy")

bbox = np.load("./eval/AFLW2000-3D_crop.roi_box.npy")
fname_list = Path("./AFLW2000-3D_crop.list").read_text().strip().split('\n')

# coordinate process
pts68[:,0,:] = (pts68[:,0,:] - bbox[:,[0]]) / (bbox[:,[2]] - bbox[:,[0]]) * 120
pts68[:,1,:] = (pts68[:,1,:] - bbox[:,[1]]) / (bbox[:,[3]] - bbox[:,[1]]) * 120

fig, ax = plt.subplots()

# plot image
ax.imshow(img)

# scatter landmarks
idx = fname_list.index(img_name)
ax.scatter(pts68[idx,0,:], pts68[idx,1,:])

fig.savefig("alignment.jpg")

I’m not sure what inaccurateness you are referring to.

For the groundtruth, the original AFLW2000-3D annotation exactly contains much error. Its annotation process (described in the associated CVPR 2016 paper) is automatic but not manual with several failure cases as you shown here. Thus, there is another reannotated version of AFLW2000-3D for remedy but it’s more recent and the convention is comparison on the original one. I think it makes more sense to evaluate on the reannotated version. Or recently there are some other better-quality datasets for facial alignments.

For the model, I agree that there are still some room of improvement. The model tends to produce large error on highly occluded or very large pose faces, but it’s what we always struggle with.

On Tuesday, June 27, 2023, ken881015 @.***> wrote:

-

Hello, I'm really appreciate the work you completed, SynergyNet is not only light-weighted but also keep an acceptable accuracy on AFLW2000-3D.

Although I'm also one of trainer who can't reproduce NME 3.4% (best is 3.674% after fix code problem in here https://urldefense.com/v3/__https://github.com/choyingw/SynergyNet/issues/18*issuecomment-1600030352__;Iw!!LIr3w8kk_Xxm!pQ5KerkncxONoAPU1G2QmGWgO3mKJwAwO7WOB8Cfuh8ULlea9I9anVPzlt_NWafh8M-lEXbCIaILwF97KqR6cxjX$) on original annotation of benchmark, I keep trying to analysis what kind of images model failed on it and try improve it. So, I sorted NME of 2000 images, make a grid of 48 worst images and the model alignment on it, and show the ground truth of annotation beside it. (for each pair, left is model output and right is ground truth. and its reannotated version). [image: grid_of_worst_alignment_0~47_re_v2_fix_loss_problem_80] https://urldefense.com/v3/__https://user-images.githubusercontent.com/38501223/249102379-3799a5e7-5439-4466-9bf8-73dfd3703a3e.png__;!!LIr3w8kk_Xxm!pQ5KerkncxONoAPU1G2QmGWgO3mKJwAwO7WOB8Cfuh8ULlea9I9anVPzlt_NWafh8M-lEXbCIaILwF97Kpwg4-gH$

As you can see, some annotation of Ground Truth is not accurate, (index start from 1) pair of (1,1) (1,2) is obvious that its annotation is not worth for reference..., but by other pair (e.g. (8,6)), it shows that reason of large NME is due to model performance instead of inaccurate annotation, namely, it's still have chance to be improved.

Here is part of my code to post process the file (roi_box, pts68...) you offer in the repo, and visualize the alignment on image. For the inaccurate problem, did I do anything wrong? or is there any opinion you can share for us? I'll be really appreciated for it.

put this code in ./aflw2000_data/ and you can run it

import matplotlib.pyplot as plt import numpy as np from pathlib import Path

you can select by image name

img_name = "image02156.jpg"

img = plt.imread("./AFLW2000-3D_crop/"+img_name)

choose the version of benchmark annotation (ori or re)

pts68 = np.load("./eval/AFLW2000-3D.pts68.npy") pts68 = np.load("./eval/AFLW2000-3D-Reannotated.pts68.npy")

bbox = np.load("./eval/AFLW2000-3D_crop.roi_box.npy") fname_list = Path("./AFLW2000-3D_crop.list").read_text().strip().split('\n')

coordinate process

pts68[:,0,:] = (pts68[:,0,:] - bbox[:,[0]]) / (bbox[:,[2]] - bbox[:,[0]]) 120 pts68[:,1,:] = (pts68[:,1,:] - bbox[:,[1]]) / (bbox[:,[3]] - bbox[:,[1]]) 120

fig, ax = plt.subplots()

plot image

ax.imshow(img)

scatter landmarks

idx = fname_list.index(img_name) ax.scatter(pts68[idx,0,:], pts68[idx,1,:])

fig.savefig("alignment.jpg")

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/choyingw/SynergyNet/issues/30__;!!LIr3w8kk_Xxm!pQ5KerkncxONoAPU1G2QmGWgO3mKJwAwO7WOB8Cfuh8ULlea9I9anVPzlt_NWafh8M-lEXbCIaILwF97KiF-N2mM$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKMLAJI2LHS7D5GFEDLUVQLXNKT7HANCNFSM6AAAAAAZVK3B44__;!!LIr3w8kk_Xxm!pQ5KerkncxONoAPU1G2QmGWgO3mKJwAwO7WOB8Cfuh8ULlea9I9anVPzlt_NWafh8M-lEXbCIaILwF97KtPnkr5j$ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Ph.D. candidate of Computer Science Department University of Southern California

Thanks for the reply.
The pic I show is reannotated version of benchmark. However, what surprises me is that it still has some bad annotations (such as pair(1,1), pair(1,2)). So, maybe I will find a new face alignment dataset for validation. Thank you for your suggestions.
Regarding occlusion, I am currently trying to add an augmentation technique of randomly erasing parts of the input image to improve the model's ability in handling occlusions.
Lastly, while tuning parameters and fixing some issues in the code, I recorded the NME (Normalized Mean Error) throughout the training process. I have a few questions that I would like to ask you:
- Coincidentally, between epoch 25~50 almost all of the processes show a curve in the shape of a hill.
- Surprisingly, the best NME in each process happened after milestone (default: 48, 64)
Do you think such phenomenon is explainable or just heuristic ?

Thanks for your clear visualization. As far as I know, annotation of 3D landmarks is very challenging, some recent datasets such as NoW benchmark or DAD-3DHeads (https://www.pinatafarm.com/research/dad-3dheads) may be a better choice. Otherwise, you can manually filter out bad annotations in AFLW2000-Reannotation, which is reasonable in my opinion.

Random erasing in my opinion may help in some limited cases such as cropped out faces, but some other occlusion types such as hands or scarfs are hard since the occlusion shape is irregular. An easy hack is to add those erased images in the training set and see if this trick helps in cropped out face cases.

About the NME, I'm not sure what causes this phenomenon (maybe learning rate change points), but I think the best NME happens after milestones is reasonable, since the milestone suggests when to adjust learning rate. The lower LR indicates better convergence. Something we have in our mind (but not fully tested yet) is maybe a lower final LR and longer training epochs may help attain better minima.

Ph.D. candidate of Computer Science Department University of Southern California

On Tue, Jun 27, 2023 at 11:40 PM ken881015 @.***> wrote:

Thanks for the reply.

The pic I show is reannotated version of benchmark. However, what surprises me is that it still has some bad annotations (such as pair(1,1), pair(1,2)). So, maybe I will find a new face alignment dataset for validation. Thank you for your suggestions.

Regarding occlusion, I am currently trying to add an augmentation technique of randomly erasing parts of the input image to improve the model's ability in handling occlusions.

Lastly, while tuning parameters and fixing some issues in the code, I recorded the NME (Normalized Mean Error) throughout the training process. I have a few questions that I would like to ask you: [image: image] https://urldefense.com/v3/__https://user-images.githubusercontent.com/38501223/249372723-b800dd1c-1747-4e2c-af14-2bce6d9a0003.png__;!!LIr3w8kk_Xxm!tzdzerviy2QCrtXznJylGRjRSiHv2bckDBRkEF80pzUHoIOGQ-d2P7zqZSoFG_2qCl4UtqZ1GIpfeuhtUWCnpQOK$

Coincidentally, between epoch 25~50 almost all of the processes show a curve in the shape of a hill.

Surprisingly, the best NME in each process happened after milestone (default: 48, 64)

Do you think such phenomenon is explainable or just heuristic ?

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/choyingw/SynergyNet/issues/30*issuecomment-1610852535__;Iw!!LIr3w8kk_Xxm!tzdzerviy2QCrtXznJylGRjRSiHv2bckDBRkEF80pzUHoIOGQ-d2P7zqZSoFG_2qCl4UtqZ1GIpfeuhtUZJ0bXUQ$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKMLAJMZYO3F2DF3LSCYFRLXNPGWLANCNFSM6AAAAAAZVK3B44__;!!LIr3w8kk_Xxm!tzdzerviy2QCrtXznJylGRjRSiHv2bckDBRkEF80pzUHoIOGQ-d2P7zqZSoFG_2qCl4UtqZ1GIpfeuhtUev-Fk-7$ . You are receiving this because you commented.Message ID: @.***>

For the AFLW2000-3D, if there is some remedy for out-of-distribution cases (occlusion, underwater, or much large pose), the NME can still be improved as those cases can break down the whole performance greatly. Much of the previous methods for facial landmark generally focus on learning a good representation for face, but didn't specifically incorporate prior for OOD data. Happy to discuss more via email.

Ph.D. candidate of Computer Science Department University of Southern California

On Tue, Jun 27, 2023 at 11:58 PM Cho-Ying Wu @.***> wrote:

Thanks for your clear visualization. As far as I know, annotation of 3D landmarks is very challenging, some recent datasets such as NoW benchmark or DAD-3DHeads (https://www.pinatafarm.com/research/dad-3dheads) may be a better choice. Otherwise, you can manually filter out bad annotations in AFLW2000-Reannotation, which is reasonable in my opinion.

Random erasing in my opinion may help in some limited cases such as cropped out faces, but some other occlusion types such as hands or scarfs are hard since the occlusion shape is irregular. An easy hack is to add those erased images in the training set and see if this trick helps in cropped out face cases.

About the NME, I'm not sure what causes this phenomenon (maybe learning rate change points), but I think the best NME happens after milestones is reasonable, since the milestone suggests when to adjust learning rate. The lower LR indicates better convergence. Something we have in our mind (but not fully tested yet) is maybe a lower final LR and longer training epochs may help attain better minima.

Ph.D. candidate of Computer Science Department University of Southern California

On Tue, Jun 27, 2023 at 11:40 PM ken881015 @.***> wrote:

Thanks for the reply.

The pic I show is reannotated version of benchmark. However, what surprises me is that it still has some bad annotations (such as pair(1,1), pair(1,2)). So, maybe I will find a new face alignment dataset for validation. Thank you for your suggestions.

Regarding occlusion, I am currently trying to add an augmentation technique of randomly erasing parts of the input image to improve the model's ability in handling occlusions.

Lastly, while tuning parameters and fixing some issues in the code, I recorded the NME (Normalized Mean Error) throughout the training process. I have a few questions that I would like to ask you: [image: image] https://urldefense.com/v3/__https://user-images.githubusercontent.com/38501223/249372723-b800dd1c-1747-4e2c-af14-2bce6d9a0003.png__;!!LIr3w8kk_Xxm!tzdzerviy2QCrtXznJylGRjRSiHv2bckDBRkEF80pzUHoIOGQ-d2P7zqZSoFG_2qCl4UtqZ1GIpfeuhtUWCnpQOK$

Coincidentally, between epoch 25~50 almost all of the processes show a curve in the shape of a hill.

Surprisingly, the best NME in each process happened after milestone (default: 48, 64)

Do you think such phenomenon is explainable or just heuristic ?

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/choyingw/SynergyNet/issues/30*issuecomment-1610852535__;Iw!!LIr3w8kk_Xxm!tzdzerviy2QCrtXznJylGRjRSiHv2bckDBRkEF80pzUHoIOGQ-d2P7zqZSoFG_2qCl4UtqZ1GIpfeuhtUZJ0bXUQ$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKMLAJMZYO3F2DF3LSCYFRLXNPGWLANCNFSM6AAAAAAZVK3B44__;!!LIr3w8kk_Xxm!tzdzerviy2QCrtXznJylGRjRSiHv2bckDBRkEF80pzUHoIOGQ-d2P7zqZSoFG_2qCl4UtqZ1GIpfeuhtUev-Fk-7$ . You are receiving this because you commented.Message ID: @.***>

choyingw / SynergyNet

2 version of annotation of benchmark are inaccurate.. if I did any wrong? #30

Hello, I'm really appreciate the work you completed, SynergyNet is not only light-weighted but also keep an acceptable accuracy on AFLW2000-3D.

put this code in ./aflw2000_data/ and you can run it

you can select by image name

choose the version of benchmark annotation (ori or re)

coordinate process

plot image

scatter landmarks