EvelynFan / FaceFormer

[CVPR 2022] FaceFormer: Speech-Driven 3D Facial Animation with Transformers
MIT License
778 stars 133 forks source link

Input when i don't want any template and style embedding #43

Open ujjawalcse opened 1 year ago

ujjawalcse commented 1 year ago

Hey @zlinao @EvelynFan , Thanks for this clear code structure. I'm trying to train this model after removing the style emb layer without any template

So, while forward pass you're using like,

template = template.unsqueeze(1) # (1,1, V*3)
obj_embedding = self.obj_vector(one_hot)#(1, feature_dim)

And when using teacher_forcing, input is created like this

if teacher_forcing:
            vertice_emb = obj_embedding.unsqueeze(1) # (1,1,feature_dim)
            style_emb = vertice_emb  
            vertice_input = torch.cat((template,vertice[:,:-1]), 1) # shift one position
            vertice_input = vertice_input - template

and if not teacher_forcing, input is created like this

if i==0:
                    vertice_emb = obj_embedding.unsqueeze(1) # (1,1,feature_dim)
                    style_emb = vertice_emb
                    vertice_input = self.PPE(style_emb)
                    print('vertice_input shape:',vertice_input.shape)
                else:
                    vertice_input = self.PPE(vertice_emb)
                 -------------------------------------------
                 ------------------------------------------
                 vertice_emb = torch.cat((vertice_emb, new_output), 1)

So, I changes these lines using a zero vector with same dimension as need as first input,

if teacher_forcing:
            first_input=torch.FloatTensor(np.zeros([1,input_dim])).unsqueeze(1).to(device=self.device)
            vertices_input=torch.cat((first_input,verices[:,:-1]), 1) # shift one position

Since i concatenated zero vector ,there is no need to subtract any thing as you did in your case (subtracted the template)

Again, while not using teacher forcing, input is like this

if i==0:
                    vertices_emb=torch.FloatTensor(np.zeros([1,feature_dim])).unsqueeze(1).to(device=self.device)
                    style_emb=vertices_emb
                    vertices_input=self.PPE(style_emb)
                else:
                    vertices_input=self.PPE(vertices_emb)
                  -------------------------------------------
                 ------------------------------------------
                 vertice_emb = torch.cat((vertice_emb, new_output), 1)

The whole flow is working, But the training loss is slowly fixed (between 0.0035 to 0.0040) after 2-3 epochs. Also while predicting, hidden states produced is the same for each frames and hence the animation is also the same for all the frames.

Please suggest what i'm missing here or anything to be added .

Thanks again.

rina22 commented 1 year ago

any updates?

xiaodongyichuan commented 1 year ago

Hey @zlinao @EvelynFan , Thanks for this clear code structure. I'm trying to train this model after removing the style emb layer without any template

So, while forward pass you're using like,

template = template.unsqueeze(1) # (1,1, V*3)
obj_embedding = self.obj_vector(one_hot)#(1, feature_dim)

And when using teacher_forcing, input is created like this

if teacher_forcing:
            vertice_emb = obj_embedding.unsqueeze(1) # (1,1,feature_dim)
            style_emb = vertice_emb  
            vertice_input = torch.cat((template,vertice[:,:-1]), 1) # shift one position
            vertice_input = vertice_input - template

and if not teacher_forcing, input is created like this

if i==0:
                    vertice_emb = obj_embedding.unsqueeze(1) # (1,1,feature_dim)
                    style_emb = vertice_emb
                    vertice_input = self.PPE(style_emb)
                    print('vertice_input shape:',vertice_input.shape)
                else:
                    vertice_input = self.PPE(vertice_emb)
                 -------------------------------------------
                 ------------------------------------------
                 vertice_emb = torch.cat((vertice_emb, new_output), 1)

So, I changes these lines using a zero vector with same dimension as need as first input,

if teacher_forcing:
            first_input=torch.FloatTensor(np.zeros([1,input_dim])).unsqueeze(1).to(device=self.device)
            vertices_input=torch.cat((first_input,verices[:,:-1]), 1) # shift one position

Since i concatenated zero vector ,there is no need to subtract any thing as you did in your case (subtracted the template)

Again, while not using teacher forcing, input is like this

if i==0:
                    vertices_emb=torch.FloatTensor(np.zeros([1,feature_dim])).unsqueeze(1).to(device=self.device)
                    style_emb=vertices_emb
                    vertices_input=self.PPE(style_emb)
                else:
                    vertices_input=self.PPE(vertices_emb)
                  -------------------------------------------
                 ------------------------------------------
                 vertice_emb = torch.cat((vertice_emb, new_output), 1)

The whole flow is working, But the training loss is slowly fixed (between 0.0035 to 0.0040) after 2-3 epochs. Also while predicting, hidden states produced is the same for each frames and hence the animation is also the same for all the frames.

Please suggest what i'm missing here or anything to be added .

Thanks again.

i have same question? any update?

JSHZT commented 1 year ago

any update?

shivangi-aneja commented 1 year ago

Also facing the same issue. Did anyone find a fix?

Shirley-0708 commented 10 months ago

@JSHZT @shivangi-aneja @xiaodongyichuan Also facing the same issue. Did you fix this problem?

JSHZT commented 10 months ago

@JSHZT @shivangi-aneja @xiaodongyichuan Also facing the same issue. Did you fix this problem?

In fact, the modifications described above are not rigorous. I don't agree with the operation of splicing with zero vectors, because the entire sequence has been templated. According to the author's original idea, the network learns the displacement relative to a specific template, so After splicing with the template and then subtracting the template, the first frame is a neutral expression with a displacement of zero. Splicing with the zero vector makes the logic of this task wrong. The way I achieve a similar purpose is to redo the data, but this is undoubtedly expensive, but other than that, I can't think of other rigorous methods, because this data and task have a tight coupling of identity and style.hope it is of help to you!