StijnVerdenius / DeepFaceImageSynthesis

Project considered the problem of GAN-generated photorealistic, personalised facial models, that transfers a person's (source) facial expression movements to another person's (target) face for video generation
9 stars 2 forks source link

Meeting 04-06-2019 with Minh #1

Closed Ignotus closed 5 years ago

Ignotus commented 5 years ago

When do you guys want to meet? I may have a meeting from 3pm to 4pm.

StijnVerdenius commented 5 years ago

we can either do 13-15 o clock or 16-18?

Ignotus commented 5 years ago

13-14 works for me

EliasKassapis commented 5 years ago

Qs and Answers

Q: Training will take fairly long and our laptop-GPUs are not really made for that. What’s the proposed solution for getting access to GPUs?

A: You can apply for DAS4 I believe (maybe you can contact Dennis Koelma for that koelma@uva.nl). You can also request GPU resources of Surfsara https://userinfo.surfsara.nl/systems/lisa/account .

Q: Few-Shot seems to be doing pretty much what we want. What is different in our project?

A: Conceptually pretty much it’s, yes. But let’s have some baseline first. And then we can explore different ideas.

Q: Should we adapt it to U-net?

A: U-net can be a later step. Usually it’s trained slower, but produces sharper results.

Q: Can we use instance normalization with a U-net architecture?

A: You can.

Q: Are we using triple consistency loss? Or formulating our own loss using the three papers?

A: I may suggest to have a pix2pix Conditional GAN pipeline first. Those papers are just inspiration for you to decide which direction you may go. Just keep it simple at the beginning.

Q: In the paper you shared (Few-Shot), how exactly are the projected embeddings used for the generator (just checking we understood it correctly)

A: Formula (1). Embedder is like a feature extractor from images. If you have multiple frames then features are averaged. And since output of embedder is significantly smaller than input then it will try to compress and preserve only most valuable info.

Q: For the embedder (few-shot) are landmarks taken into account and if so, how?

A: Section (3.1): landmarks are rasterized into three-channel images using apredefined set of colors to connect certain landmarks with line segments.

Q: Propose adversarial predicting of landmarks like in the paper fader networks (add a discriminator to enforce embedding invariance wrt pose and mimics). Is this a good idea at all?

A: It will be then an opposite task?

Q: Can we use our implementation afterwards (by law)? if not, why not?

A: Which law? Why not?

StijnVerdenius commented 5 years ago

Dear Minh, @Ignotus

I would like some clarification to some of your answers.

First, In question 3 you replied that U-net is a later step and then in question 5 you say to have the pix2pix pipeline first. But pix2pix uses a U-net. So that leaves me a bit confused.

Second, to question 6, we understand the idea of extracting the features and averaging, just not how we practically inject the embedding into the generator. As it seems to be injected at multiple locations (i.e. not as a network input) and we need to implement it after all.

Third, to question 7, What does that mean practically, do we recolour the pictures that are going into the embedder? Or are we adding three channels to the input feature map? Or something else?

Fourth, for the final question we meant to ask: "Are we allowed to personally use our implementation after the course/project is finished. Do we maintain the rights?"

Finally, at what location will we meet today?