Using custom input image shape

pranjal-joshi-cc commented 2 years ago

Can we use a custom input image shape while training? I am looking forward to set an input shape of (512, 512, 3) but anything else that (32, 32, 3) throws a mismatch error. Can you explain how to determine the encoder and decoder network parameters? Thanks!

mauicv commented 2 years ago

Hey @pranjal-joshi-cc, Thanks for opening the issue. Can you be a bit more specific or share some code? Are you following an example notebook or page in the docs? If your looking at something like this then I think it should be enough to change the InputLayer line in the model definition to:

encoder_net = tf.keras.Sequential([
    InputLayer(input_shape=(512, 512, 3)), # <-- CHANGE THE SHAPE HERE
    Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
    ...
    Dense(encoding_dim,)
])

and any other references to that shape throughout? Hard to help without more details though?

pranjal-joshi-cc commented 2 years ago

Hi @mauicv and Thanks for the quick reply. I am using VAE Outlier detection method and I've tried changing the input_shape as suggested but it throws following error with the following code:

RESOLUTION = 512
IMAGE_SIZE = (RESOLUTION, RESOLUTION)
IMAGE_SHAPE = (RESOLUTION, RESOLUTION, 3)
...
...

latent_dim = 1024

encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=IMAGE_SHAPE),
      Conv2D(32, 3, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 3, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(512, 3, strides=2, padding='same', activation=tf.nn.relu)
  ])

decoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(latent_dim,)),
      Dense(4*4*128),
      Reshape(target_shape=(4, 4, 128)),
      Conv2DTranspose(256, 3, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(64, 3, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(3, 3, strides=2, padding='same', activation='sigmoid')
  ])

od = OutlierVAE(threshold=.015,  # threshold for outlier score
                score_type='mse',  # use MSE of reconstruction error for outlier detection
                encoder_net=encoder_net,  # can also pass VAE model instead
                decoder_net=decoder_net,  # of separate encoder and decoder
                latent_dim=latent_dim,
                samples=2)

od.fit(x_train,
       loss_fn=elbo,
       cov_elbo=dict(sim=.05),
       epochs=100,
       verbose=True)

the fit method throws following error:

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
/Users/pranjaljoshi/Documents/CC Projects/Goodpack/Alibi/alibi_test.ipynb Cell 8' in <cell line: 2>()
      [1](vscode-notebook-cell:/Users/pranjaljoshi/Documents/CC%20Projects/Goodpack/Alibi/alibi_test.ipynb#ch0000008?line=0) # train
----> [2](vscode-notebook-cell:/Users/pranjaljoshi/Documents/CC%20Projects/Goodpack/Alibi/alibi_test.ipynb#ch0000008?line=1) od.fit(x_train,
      [3](vscode-notebook-cell:/Users/pranjaljoshi/Documents/CC%20Projects/Goodpack/Alibi/alibi_test.ipynb#ch0000008?line=2)        loss_fn=elbo,
      [4](vscode-notebook-cell:/Users/pranjaljoshi/Documents/CC%20Projects/Goodpack/Alibi/alibi_test.ipynb#ch0000008?line=3)        cov_elbo=dict(sim=.05),
      [5](vscode-notebook-cell:/Users/pranjaljoshi/Documents/CC%20Projects/Goodpack/Alibi/alibi_test.ipynb#ch0000008?line=4)        epochs=100,
      [6](vscode-notebook-cell:/Users/pranjaljoshi/Documents/CC%20Projects/Goodpack/Alibi/alibi_test.ipynb#ch0000008?line=5)        verbose=True)
      [8](vscode-notebook-cell:/Users/pranjaljoshi/Documents/CC%20Projects/Goodpack/Alibi/alibi_test.ipynb#ch0000008?line=7) # save the trained outlier detector
      [9](vscode-notebook-cell:/Users/pranjaljoshi/Documents/CC%20Projects/Goodpack/Alibi/alibi_test.ipynb#ch0000008?line=8) save_detector(od, filepath)

File ~/miniforge3/envs/alibi/lib/python3.8/site-packages/alibi_detect/od/vae.py:133, in OutlierVAE.fit(self, X, loss_fn, optimizer, cov_elbo, epochs, batch_size, verbose, log_metric, callbacks)
    130     kwargs['loss_fn_kwargs'] = {cov_elbo_type: tf.dtypes.cast(cov, tf.float32)}
    132 # train
--> 133 trainer(*args, **kwargs)

File ~/miniforge3/envs/alibi/lib/python3.8/site-packages/alibi_detect/models/tensorflow/trainer.py:85, in trainer(model, loss_fn, x_train, y_train, dataset, optimizer, loss_fn_kwargs, preprocess_fn, epochs, reg_loss_fn, batch_size, buffer_size, verbose, log_metric, callbacks)
     83 if isinstance(loss_fn, Callable):  # type: ignore
     84     args = [y, y_hat] if tf.is_tensor(y_hat) else [y] + list(y_hat)
---> 85     loss = loss_fn(*args)
     86 else:
     87     loss = 0.

File ~/miniforge3/envs/alibi/lib/python3.8/site-packages/alibi_detect/models/tensorflow/losses.py:44, in elbo(y_true, y_pred, cov_full, cov_diag, sim)
...
   7105 def raise_from_not_ok_status(e, name):
   7106   e.message += (" name: " + name if name is not None else "")
-> 7107   raise core._status_to_exception(e) from None

InvalidArgumentError: Incompatible shapes: [15,786432] vs. [15,3072] [Op:Sub]

mauicv commented 2 years ago

Ah, sorry, I forgot you'll also need to ensure the decoder creates the correct shape of output. Basically what's happening is the encoder maps (15, 512, 512, 3) to the latent space and the decoder maps the latent space to (15, 32, 32, 3) and this causes the shape mismatch in the loss function. You'll have to change the decoder architecture to create the correct output shape. I'd also consider adding a few more convolutional and deconvolutional layers as well. Something like the following should work:

encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=IMAGE_SHAPE),
      Conv2D(32, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(256, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(516, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(1024, 4, strides=2, padding='same', activation=tf.nn.relu),
  ])

decoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(latent_dim,)),
      Dense(8*8*1024),
      Reshape(target_shape=(8, 8, 1024)),
      Conv2DTranspose(1024, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(516, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(256, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(32, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(3, 1, strides=1, padding='same', activation='sigmoid')
  ])

pranjal-joshi-cc commented 1 year ago

@mauicv I've understood the encoder part that through the strides parameter, we control the dimensionality reduction and with encoder_net.summary() we can see the size of last convolution operation i.e. N x N x Filters. However, is it necessary to always map the encoder into 32 x 32 for alibi-detect to work or the choice of autoencoder is purely arbitrary?

Also, Please explain how to calculate and reshape dense layers in decoder net as its quite confusing for me.

decoder_net = tf.keras.Sequential(
  [
      ...
      Dense(8*8*1024),
      Reshape(target_shape=(8, 8, 1024)),
      ...
  ])

How to determine the number of Dense units i.e. 8*8*1024 and how to determine the reshaping in the next layer? @roshan-dadlaney

mauicv commented 1 year ago

Hey @pranjal-joshi-cc,

I've understood the encoder part that through the strides parameter, we control the dimensionality reduction and with encoder_net.summary() we can see the size of last convolution operation i.e. N x N x Filters. However, is it necessary to always map the encoder into 32 x 32 for alibi-detect to work or the choice of autoencoder is purely arbitrary?

I'm not completely sure what you mean here? The choice of the autoencoder is arbitrary except that:

The architecture needs to be sufficient to model the data well. What I mean by this is when it's trained in the detector fit method it needs to reduce the reconstruction error well. This might not be possible if you don't have enough capacity in the network. As an example, if you don't choose a big enough latent dimension you might have difficulty. I don't think this should be an issue for the models defined above though.
The VAE needs to give the same shape as output as it takes as input. For the purposes of the VAEOutlier this really only applies to the decoder. It needs to ensure that the decoder maps from the latent space of size latent_dim to the same shape as the original input image, so in your case (512, 512, 3).

In terms of the output shape of the encoder, it doesn't really matter as long as the capacity is sufficient, basically that you don't reduce the dimensionality too much. For the architecture I provided above for instance we have:

encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=IMAGE_SHAPE),
      Conv2D(32, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(256, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(516, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(1024, 4, strides=2, padding='same', activation=tf.nn.relu),
  ])

and the summary is:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 256, 256, 32)      1568      

 conv2d_1 (Conv2D)           (None, 128, 128, 64)      32832     

 conv2d_2 (Conv2D)           (None, 64, 64, 128)       131200    

 conv2d_3 (Conv2D)           (None, 32, 32, 256)       524544    

 conv2d_4 (Conv2D)           (None, 16, 16, 516)       2114052   

 conv2d_5 (Conv2D)           (None, 8, 8, 1024)        8455168   

=================================================================
Total params: 11,259,364
Trainable params: 11,259,364
Non-trainable params: 0
_________________________________________________________________

So the output shape of the encoder_net is (8, 8, 1024). Note that the VAEOutlier adds some Dense layers to the encoder_net to transform the (8, 8, 1024) output to the latent space of dimension 1024 where you've chosen latent_dim=1024.

Also, Please explain how to calculate and reshape dense layers in decoder net as its quite confusing for me. How to determine the number of Dense units i.e. 8*8*1024 and how to determine the reshaping in the next layer?

The decoder_net maps from the latent space of dimension 1024 (in our case) to the output shape (512, 512, 3). So it is going to take a vector of length latent_dim. We want to transform this to a shape that can then easily be scaled up to (512, 512, 3). You can do this a number of ways but it's easiest if we set up the Conv2dTranspose operation to double the size of the height and width at each layer of the network. The reason we choose 8*8*1024 is just that this can then be reshaped into (8, 8, 1024). We can then upscale this to obtain the output image by applying each of the transpose layers. For instance, given the architecture I suggested above:

latent_dim = 1024

decoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(latent_dim,)),
      Dense(8*8*1024),
      Reshape(target_shape=(8, 8, 1024)),
      Conv2DTranspose(1024, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(516, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(256, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(32, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2DTranspose(3, 1, strides=1, padding='same', activation='sigmoid')
  ])

The latent vector of shape (1, 1024) is mapped to a vector of shape (1, 8*8*1024) which is then reshaped to (1, 8, 8, 1024) and then upscaled by each of the transpose layers: (1, 8, 8, 1024) -> (1, 16, 16, 1024) -> (1, 32, 32, 516) -> (1, 64, 64, 256) -> (1, 128, 128, 128) -> (1, 256, 256, 64) -> (1, 512, 512, 32) -> (1, 512, 512, 3). So (8*8*1024) is really chosen as a convenience in order to reshape the tensor. Typically we choose image height and width sizes to be powers of 2 just becuase it makes this operation of scaling up and down simpler but in general this doesn't have to be the case. The formula for the output size of a transpose convolution is documented here.

ascillitoe commented 1 year ago

@pranjal-joshi-cc has @mauicv answered your question above? If so we shall close this issue 🙂

ascillitoe commented 1 year ago

Thanks for confirming @pranjal-joshi-cc!

SeldonIO / alibi-detect

Using custom input image shape #547