cuiaiyu / dressing-in-order

(ICCV'21) Official code of "Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing" by Aiyu Cui, Daniel McKee and Svetlana Lazebnik
https://cuiaiyu.github.io/dressing-in-order
Other
513 stars 127 forks source link

Questions and thoughts on model size and performance #43

Closed jackylu0124 closed 1 year ago

jackylu0124 commented 2 years ago

First of all, thank you for this incredible project!

I would like to hear about some of your insights on the trade-off between model performance and model quality, especially with regards to DiOr. One thing that I have observed during some profiling experiments of the DiOr pipeline is that some of the models are relatively large and require a relatively large number of FLOPs during inference, which makes it unsuitable for some application scenarios. For example, among all models some of the largest would be the generator network (~16.5 million parameters) at the end of pipeline, the flow network (~6.6 million parameters), and the segment encoder network (~1.2 million parameters).

I dug into the aforementioned models’ architecture a bit, and at first glance I found that convolutions with large kernel sizes (5 or 7) are used at quite a few locations, such as inside the ContentEncoder and Decoder in the generator, and the ADGANEncoder component in the segment encoder network. Do you think it would be a good idea to change these convolution layers with large kernel sizes into a series/stack of smaller ones in order to boost model performance and quality? For example, a stack of 3x3 convolution layers with stride of 1 would have the same receptive field as one 7x7 convolution layer, and not only do they have fewer parameters (3*(3^2C^2)) than the 7x7 convolution layer (7^2C^2), they will also give the model more non-linearities due to the greater depth.

Another idea that I have been toying around is the possibility of using separable convolutions like the ones used in MobileNets in order to reduce model size and latency. Nevertheless, I don’t know if using the separation convolution strategy would have lead a noticeable impact on the quality of the results produced by the DiOr models. I would really love to hear your opinions and thoughts on this.

Thanks a lot for the great work again!

cuiaiyu commented 2 years ago

Hi! This is a very interesting topic and thank you for the thoughtful question!

First, for the convolution specs (e.g., the [7,1,3], [4,2,1] kenel/stride/padding) in encoders and decoder, we directly inherited those hyperparameters from the prior work ADGAN for a fair comparison, so we have never tuned them. Therefore, they are not necessary (and very likely not) the optimal choices for the convolution layers.

From my intuition and limited experiments for those kernel sizes, replacing those with stacked 3x3 convolution kernels would not change the performance too much if there is any, because the convolution kernel size in the encoders/decoder is not the crucial part for DiOr to get its performance. Changing the channel size (--ngf) or the resnet blocks in the decoder would have a more direct impact on the performance.

For MobileNets-like convolutions, I personally never play around with them before, so it's hard to tell the behaviors before any experiments. However, I think this would be a very meaningful follow-up and I would like to know if it could make the model more computationally efficient as well! :)

jackylu0124 commented 2 years ago

Thank you very much for your reply and the valuable insights! You brought up a very good point, the convolution layers with the large kernel sizes aren’t the ones that make up the core/bulk of the encoder and decoder, and modifying the channel size or the res blocks might lead to a more noticeable improvement to the inference performance.

In terms of using strategies from MobileNets in GANs, I investigated a bit online and found that there is certainly more literature that focus on techniques for improving performance in the traditional CV deep learning space (detection, segmentation, etc.) than in the relatively nascent “generative” deep learning space. Perhaps it’s because GANs are still famous for their power and the wow factor in generating entirely original photorealistic images, and that’s likely the key aspect that many have been working on. It would be very interesting like you said to explore the possibilities of integrating some of the techniques from the “traditional” CV deep learning space into GANs. There are now more and more practical application scenarios for GANs in real life, and the demand for high-quality yet fast results has never been greater.

Again, thank you very much for your insights and I cannot wait to see what you come up with next!

imr555 commented 2 years ago

@jackylu0124 , If you are interested in effcient Depthwise Seperable Convolution based image generation in the GAN Space, you might find this work quite interesting. They provide comparisons with favorable GAN architectures before them

Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis https://arxiv.org/abs/2101.04775

It was published at ICLR21 and the infamous lucidrains created an awesome working repository based upon the conclusions in the paper in the following repo. https://github.com/lucidrains/lightweight-gan

cuiaiyu commented 2 years ago

By the way, this gan-compression paper introduces a distilling method to make an "equivalently" performing but much more efficient network copy for inference purpose, which might also be helpful for reducing FLOPs.

https://arxiv.org/pdf/2003.08936.pdf

jackylu0124 commented 2 years ago

Hi Ifty and Aiyu,

Thank you very much for sharing the papers! I had some more time today and read them in details, and they are indeed very intriguing! It's interesting that the two papers choose two quite different routes in tackling the performance issues commonly found in GAN.

Based on my preliminary impressions of these two papers thus far, Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis seems to focus more on "human-driven" architectural design change, such as the introduction of the Skip-Layer channel-wise Excitation (SLE) module, and with an emphasis on unconditional GAN. On the other hand, GAN Compression: Efficient Architectures for Interactive Conditional GANs focuses on using distillation techniques as well as "machine-oriented" architecture search for improving the model's inference performance.

Both of them are very rich in technical details and ideas, and I definitely need to spend some more time to digest them. Thank you so much again for sharing these interesting works with me!