Sik-Ho Tsang | Review -- BiGAN: Adversarial Feature Learning (GAN).

NorbertZheng commented 1 year ago

Sik-Ho Tsang. Review — BiGAN: Adversarial Feature Learning (GAN).

NorbertZheng commented 1 year ago

Overview

Bidirectional Generative Adversarial Networks (BiGANs): Learning the Inverse Mapping, from Image Space to Latent Space.

In this story, Adversarial Feature Learning, (BiGAN), by the University of California, and the University of Texas, is briefly reviewed. In this paper:

Bidirectional Generative Adversarial Network (BiGAN) is designed as a means of learning inverse mapping, i.e. projecting data $x$ back into the latent space $z$.
This resulting learned feature representation is useful for auxiliary supervised discrimination tasks.

This is a paper in 2017 ICLR with over 1100 citations.

The idea is the same as ALI while they are proposed independently and published in the same conference (2017 ICLR). Some papers would cite both BiGAN and ALI together when talking about this idea.

NorbertZheng commented 1 year ago

BiGAN: Overall Structure

BiGAN: Overall Structure.

In addition to the generator $G$ from the standard GAN, BiGAN includes an encoder $E$ which maps data $x$ to latent representations $z$.
The BiGAN discriminator $D$ discriminates not only in data space ( $x$ versus $G(z)$ ) but jointly in data and latent space (tuples $(x, E(x))$ versus $(G(z), z))$, where the latent component is either an encoder output $E(x)$ or a generator input $z$.
The data space is flattened as a vector and concatenated with the latent space vector, then input into the discriminator $D$.
In this context, a latent representation $z$ may be thought of as a “label” for $x$, but one which came for “free,” without the need for supervision.
The BiGAN training objective is defined as a minimax objective:

$$ \min{G,E}\max{D}V(D,E,G) $$

$$ V(D,E,G):=\mathbb{E}{x \sim p(x)}[\underbrace{\mathbb{E}{z \sim p{E}(\cdot|x)}[logD(x,z)]}{logD(x,E(x))}]+\mathbb{E}{z \sim p(z)}[\underbrace{\mathbb{E}{x \sim p{G}(\cdot|z)}[log(1-D(x,z))]}{log(1-D(G(z),z))}]. $$

The same alternating gradient based optimization as GAN is used.
In one iteration, the discriminator parameters $\theta D$ are updated by taking one or more steps in the positive gradient direction.
Then, the encoder parameters $\theta E$ and generator parameters $\theta G$ are together updated by taking a step in the negative gradient direction.

A model trained to predict features $z$ given data $x$ should learn useful semantic representations. BiGAN objective forces the encoder $E$ to do exactly this.

In order to fool the discriminator at a particular $z$, the encoder must invert the generator at that $z$, such that $E(G(z))=z$, which is exactly $L_{g}$ item in TEM doing!

At that moment, generating high-resolution images remains difficult for generative models. Thus, the encoder may take higher resolution input while the generator output and discriminator input remain low resolution.

NorbertZheng commented 1 year ago

Experimental Results

BiGAN is trained by first training them unsupervisely, then transferring the encoder’s learned feature representations for use in auxiliary supervised learning tasks.

Permutation-Invariant MNIST

Each $28\times 28$ digit image must be treated as an unstructured $784D$ vector.
The latent vector $z$ is $50D$ vector.

One Nearest Neighbors (1NN) classification accuracy (%) on the permutation-invariant MNIST test set in the feature space.

AE is the Autoencoder learnt by $l{2}$ or $l{1}$, proposed by Prof. Hinton in 2006.

All methods, including BiGAN, perform at roughly the same level. This result is not overly surprising given the relative simplicity of MNIST digits. Qualitative results for permutation-invariant MNIST BiGAN training, including generator samples $G(z)$, real data $x$, and corresponding reconstructions $G(E(x))$.

Digits generated by the generator $G$ in nearly perfectly match the data distribution (qualitatively, e.g. pixel-level), as shown above.

NorbertZheng commented 1 year ago

ImageNet

The encoder $E$ architecture follows AlexNet through the fifth and last convolution layer (conv5), with local response normalization (LRN) layers removed and batch normalization with leaky ReLU non-linearity applied to the output of each convolution at unsupervised training time.
The encoder input images have the size of $112\times 112$ or $64\times 64$.
The latent vector is a $200D$ vector.

Qualitative results for ImageNet BiGAN training, including generator samples $G(z)$, real data $x$, and corresponding reconstructions $G(E(x))$.

As shown above, the reconstructions, while certainly imperfect, demonstrate empirically that the BiGAN encoder $E$ and generator $G$ learn approximate inverse mappings.

Classification accuracy (%) for the ImageNet LSVRC validation set.

The above evaluation is performed with various portions of the network frozen, or reinitialized and trained from scratch.
e.g., in the conv3 column, the first three layers – conv1 through conv3 — are transferred and frozen, and the last layers — conv4, conv5, and fully connected layers — are reinitialized and trained fully supervised for ImageNet classification.

BiGAN is competitive with these contemporary visual feature learning methods.

NorbertZheng commented 1 year ago

PASCAL VOC

Classification and Fast R-CNN detection results for the PASCAL VOC 2007 test set and FCN segmentation results on the PASCAL VOC 2012 validation set.

The transferability of BiGAN representations to the PASCAL VOC is evaluated.
Classification models are trained with various portions of the AlexNet model frozen.
In the fc8 column, only the linear classifier (a multinomial logistic regression) is learned — in the case of BiGAN, on top of randomly initialized fully connected (FC) layers fc6 and fc7.
In the fc6–8 column, all three FC layers are trained fully supervised with all convolution layers frozen.
Finally, in the all column, the entire network is “fine-tuned”.
BiGAN outperforms other unsupervised (unsup.) feature learning approaches, including the GAN-based baselines, and despite its generality, is competitive with contemporary self-supervised (self-sup.) feature learning approaches specific to the visual domain.
(If interested, please read the paper for more details.)

NorbertZheng commented 1 year ago

Reference

[2017 ICLR] [BiGAN] Adversarial Feature Learning.

NorbertZheng commented 1 year ago

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [SimGAN] [BiGAN] Image-to-image Translation [Pix2Pix] [UNIT] Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN] Blur Detection [DMENet] Camera Tampering Detection [Mantini’s VISAPP’19] Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]

NorbertZheng / read-papers