Clarification needed. - Githubissues

nile649 commented 4 years ago

First, Encoder Let me make sure I understand the idea. Image I -> SE -> R_se Image I -> BE -> R_be

Linear Embedding (L_embed between SE and BE) to train distillation. In the case of training Encoders, Decoder is left out from the computation.

Second, Decoder In the paper, I didn’t see, but you have trained a small decoder in the code. So could you give a very high-level idea of training the decoder? Is it necessary to train a small decoder? or the original decoder with a small encoder still gave a sufficiently good results.

MingSun-Tse commented 4 years ago

First, Encoder Let me make sure I understand the idea. Image I -> SE -> R_se Image I -> BE -> R_be

Linear Embedding (L_embed between SE and BE) to train distillation. In the case of training Encoders, Decoder is left out from the computation.

Second, Decoder In the paper, I didn’t see, but you have trained a small decoder in the code. So could you give a very high-level idea of training the decoder? Is it necessary to train a small decoder? or the original decoder with a small encoder still gave a sufficiently good results.

Thanks for your interest in our work! (1) The procedure for WCT should be: Image I -> SE -> BD -> R_se. For R_se (to make sure I understand your notation, this should stand for the reconstructed image in the SE pipeline), there are two losses: pixel reconstruction and perceptual loss (basically, this part is the same with that in WCT). There is no need to get R_be because it is not involved in any loss calculation (although I did cache them in the code simply for logging).

In the case of training encoders, the decoder is fixed, yet not left out from the computation, they are still in the pipeline in order to obtain the pixel and perceptual losses.

(2) In our paper, we mentioned in Sec.4.1: Their mirrored decoders are trained by the same rule as the first step using the loss (1). The keyword here is "mirrored", namely, since the encoder is now small, the mirrored decoder is also a small one instead of the original one. This arises from the practical aim (which is also our general motive of this paper): we want to achieve large input resolution in NST. Of course the original decoder with a small encoder (ie, "SE+BD") will give sufficiently good results (after all, the BD has a large capacity than SD). But "SE+BD" does not really eliminate the resolution problem, since in the BD part, the feature maps are still too large for a 12GB GPU. This is why we train a small decoder. There is no big "high-level idea" here, simply out of practical motives :-).

nile649 commented 4 years ago

Thanks for the reply it clears my most question, but let me repeat. I->SE->BD-> R_se I->BE->BD-> R_bd

where BE and BD are frozen, and only SE is supposed to be trained. The process will be the same for Adain? Instead of reconstructing the original Image, we would be comparing the Stylized image between R_se and R_bd?

How does the L1 Pruning come to play? is L1 pruning is used to select channels for the SE and SD?

MingSun-Tse commented 4 years ago

You are welcome. Not comparing the stylized image between R_se and R_bd. As shown in Fig.3 of the paper, the losses for AdaIN are the content and style loss, but the general spirit is the same: First think about what is the original pipeline, then, for model compression, simply replace the BE with SE and all the original losses remain the same, plus one more loss (which is the linear embedding loss to guide the intermediate layers of the SE).

What you just suggested makes sense, since we want the student to be close to the teacher, comparing R_se with R_bd is a natural idea. Yet we generally think the supervision from content and style losses will be stronger and more proper. By more proper, I mean, for each content/style pair, the pleasing stylized results are not only one. In our case, we only expect the stylized result from SE (R_se) is "visually comparable" with the original one (R_bd). They do not have to be the same (or close) in the pixel level. That is why we don't compare R_se with R_bd (and of course, you can try to add this loss, I loosely think it won't change much the results)

For L1 pruning, it is used for weight initialization of SE/SD (see Sec.4.1). In this sense, yes, it is used to select channels from the BE/BD as a base model for SE/SD.

MingSun-Tse / Collaborative-Distillation

Clarification needed. #1