deepfakes / faceswap-model

Tweaking the generative model
147 stars 133 forks source link

Why do two autoencoders need to be trained? #26

Open coldstacks opened 6 years ago

coldstacks commented 6 years ago

In the training step in faceswap, there are two autoencoders (A and B) trained, one for each of the following tasks:

input face A -> encoder -> base vector A -> decoder A -> output face which resembles A
input face B -> encoder -> base vector B -> decoder B -> output face which resembles B

Then, decoder B is applied to image A, which is what does the swap:

input face A -> encoder -> base vector A -> decoder B -> output face which resembles B

(If this is wrong, please correct me!)

First question: what does it mean for there to be only a single encoder? Since decoder A is not used in the final conversion, I assume the reason why training autoencoder A is necessary is because it contributes to the encoder. (Is this correct? Does decoder A need to be trained simply because without the decoder, there's no way to define autoencoder A's loss function?) Should I think of the encoder as something that transforms inputs into some kind of shared low-dimensional representation of both A and B...? As you can probably tell from this wording, I'm a bit stuck on this point...

Second question: since there's only a single encoder, does this mean that, during training, I should care about both loss_A and loss_B values (even though I'm only interested in swapping B to A)?

Third question: does it matter which autoencoder is trained first, during each training step? In plugins/Model_Original/Trainer.py:

loss_A = self.model.autoencoder_A.train_on_batch(warped_A, target_A)
loss_B = self.model.autoencoder_B.train_on_batch(warped_B, target_B)

If these two lines were reversed, would the after-the-training-step states of autoencoder_A and autoencoder_B be the same?

Fourth question: I noticed in plugins/Model_Original/Model.py's Decoder function, there's a single Conv2D invoked with activation='sigmoid'. Is there a reason that sigmoid is used here? I see that the encoder uses LeakyReLU instead; why is that?

Sorry for the wall of text; hopefully this is the right place to ask questions like this. If not, could someone point me in the right direction?

torzdf commented 6 years ago

I'm not sure if this will answer your question, but @Clorr gave an explanation here: https://github.com/deepfakes/faceswap/issues/229

coldstacks commented 6 years ago

Thanks @torzdf. The reason I'm asking about this stuff is because I'm thinking of changing some of these things and seeing if it improves results or not. Since training these models takes significant computational power, I'm trying to understand why things are currently set up the way they are, so that I can tinker intelligently.

@Clorr are you the original developer who wrote the core Trainer.py and Model.py code mentioned above? It would be really helpful if you could tell me if any of the above choices were essentially arbitrary, vs if they were originally made for strong theoretical reasons (and/or because there was strong empirical evidence that it worked best). From that other issue, it sounds like for the first question (two encoders instead of just one), the answer is that two encoders empirically works best, in terms of giving you a model that's generic.