JiahuiYu / generative_inpainting

DeepFill v1/v2 with Contextual Attention and Gated Convolution, CVPR 2018, and ICCV 2019 Oral
http://jiahuiyu.com/deepfill/
Other
3.27k stars 788 forks source link

deepfillv2 Questions #284

Closed FloCF closed 5 years ago

FloCF commented 5 years ago

Awesome work and thanks @JiahuiYu for being super responsive and helpful with all the issues here.

I try to check various closed issues, but wasn't able to go trough all, so please excuse in advance if some of my questions have been answered already. Q1: You wrote in the deepfillv2 paper, that

To maintain the same efficiency with our baseline model, we slim the base model width by 25%

So that means, you cut out 25% of the layers in coarse and refine net? If so, which layers?

Q2: This has been partly discussed in #182 , I think. Do you include ones also in deepfillv2? So that G input is cat[img, mask, sketch, ones] (6 dims) and D input is cat[img, mask, sketch] (5 dim)?

Q3: As far as I understood, you used L1-loss on the coarse and refined image + SN-Patch loss on refined image only. If so, how are these 3 losses added up? You mentioned a 1:1 relation, does that mean for final loss L = L1_coarse + L1_refine + SN-Patch_refine?

Q4: I am still quite confused about implementation of contex-attten layer, so not sure if this question is actually valid. Why are you extracting 4x4 patches from the background for _raww which later gets convolved with 3x3 patches from foreground. So why 4x4 and not 3x3?

FloCF commented 5 years ago

Ah, about Q1, I guess, 'width' means number of channels, right? If so, Q1 is solved...

JiahuiYu commented 5 years ago

@FloCF Hi,

Q1: Yes the width means the number of channels across all layers.

Q2: There is still a ones mask concatenated into input. But I guess the performance without ones should be similar because we already used gated convolution.

Q3: Yes.

Q4: We extract 4x4 patches because these patches will be used as conv_transpose kernels (Usually people use 4x4 kernel size for transposed convolution). 3x3 kernel size are used because of the similar reason as convolutions. You can do ablative study to see whether different kernel sizes make differences (I am not sure).

I am closing this issue but you can still leave questions here is you have more. :)

FloCF commented 5 years ago

Hi @JiahuiYu , thanks for your quick and accurate reply as always. Due to that I was able to get my program running. So my - probably final questions - are about training details.

Q5: I recognize that during training the loss of the discriminator very often is equal 0, probably because equation%3C%3C0) did you also observed this during training? (I am currently at epoch 35 and it has been that way the past 20 epochs. So the generator did not make much progress then.)

Q6: For your loss, where did you use full generated image and where (masked generated image + real background) in the generator and discriminator loss?

Q7: Any recommendation in learning parameters? Like learning rate, Adam betas and batch size?

Many thanks in advance and keep up the awesome work!

JiahuiYu commented 5 years ago

@FloCF Hi thanks for your interest first. And you can ask question here whenever you want.

Q5: If you are training on Places2, the network should not converge at epoch 20. I have not observed any signal like yours so far.

Q6: It is the same as DeepFill v1, for pixel-wise loss we use full generated image. For GAN loss we used pasted one.

Q7: Keep the same as in DeepFill v1.

fzhang612 commented 5 years ago

may i ask a follow-up question on Q6?

what is the reason behind using different images for different losses, ie fully generated images for pixel wise loss and pasted one for GAN loss?

thanks for this great work by the way.

Get Outlook for Androidhttps://aka.ms/ghei36


From: JiahuiYu notifications@github.com Sent: Thursday, July 25, 2019 1:15:55 PM To: JiahuiYu/generative_inpainting generative_inpainting@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [JiahuiYu/generative_inpainting] deepfillv2 Questions (#284)

@FloCFhttps://github.com/FloCF Hi thanks for your interest first. And you can ask question here whenever you want.

Q5: If you are training on Places2, the network should not converge at epoch 20. I have not observed any signal like yours so far.

Q6: It is the same as DeepFill v1, for pixel-wise loss we use full generated image. For GAN loss we used pasted one.

Q7: Keep the same as in DeepFill v1.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/JiahuiYu/generative_inpainting/issues/284?email_source=notifications&email_token=AAH5OUS2C3FSCHNUGLKLW53QBEZIXA5CNFSM4H5HETRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2YL4QA#issuecomment-514899520, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAH5OUVOP6NBJYDGCY7N363QBEZIXANCNFSM4H5HETRA.

JiahuiYu commented 5 years ago

@fzhang612

For pixel-wise loss, using full generated image can encourage color consistency. For GAN loss, use pasted one help on reducing boundary artifacts overall.

htn274 commented 4 years ago

In the paper, you said that

six strided convolutions with kernel size 5 and stride 2 is stacked to captures the feature statistics of Markovian patches.

So what does "the feature statistics of Markovian patches" mean? I also read about Markovian GAN which you referred. But I am still confused.

I will be very grateful if anyone answer me.