cxy1997 / LISO

Learning Iterative Neural Optimizers for Image Steganography
https://arxiv.org/abs/2303.16206
Other
17 stars 2 forks source link

Correct usage of model? #4

Open thomas-xin opened 2 months ago

thomas-xin commented 2 months ago

First of all, let me say that this is a really cool project!

I wanted to try and test inference on single files (to see if it can be integrated in a couple projects of mine) and while I was able to get an output that resembled both the cover and data inputs, I think I'm doing something wrong as the output is very colour distorted, and sometimes depending on the content of the image (I made sure to keep image size consistent) it sometimes gives the following error or a variant of it in the encoder.forward -> conv2d step: RuntimeError: Given groups=1, weight of size [32, 33, 3, 3], expected input[1, 35, 512, 512] to have 33 channels, but got 35 channels instead

Here's the code I used to inference the model:

import numpy as np
from PIL import Image
import torch
import torchvision.transforms as transforms
import liso, liso.encoders, liso.decoders, liso.models

dtype = torch.float32
model = liso.models.LISO.load("checkpoints/div2k_jpeg/1_bits.steg")
model.encoder = model.encoder.to(dtype)
model.decoder = model.decoder.to(dtype)
if model.critic:
    model.critic = model.critic.to(dtype)
model.dtype = dtype
model.encoder.constraint = None

size = (512, 512)
im = Image.open(cover_image).resize(size, resample=Image.Resampling.LANCZOS)
da = Image.open(data_image).resize(size, resample=Image.Resampling.LANCZOS)
imt = transforms.ToTensor()(np.asanyarray(im)).unsqueeze(0).to(model.device).to(model.dtype)
dat = transforms.ToTensor()(np.asanyarray(da)).unsqueeze(0).to(model.device).to(model.dtype)

with torch.no_grad():
    resp = model.encoder(imt, dat)

im = transforms.ToPILImage()(resp[0][0].squeeze(0))
print(im)
im.save("test.png")

Let me know if I should be doing something different here. Thanks!

thomas-xin commented 2 months ago

Update: I have figured out how to run the model without errors (the input images need to have the same number of channels as bits per pixel, the images should be preprocessed using liso.loader.EVAL_TRANSFORM and postprocessed using liso.utils.to_np_img), which fixes the errors, but the colours still act quite strange.

I also tried some of the sample eval arguments on the provided dataset and received the following error in the structural_similarity function: ValueError: win_size exceeds image extent. Either ensure that your images are at least 7x7; or pass win_size explicitly in the function call, with an odd value less than or equal to the smaller side of your images. If your images are multichannel (with color channels), set channel_axis to the axis number corresponding to the channels.

cxy1997 commented 2 months ago

Hi Thomas,

Thank you for your interest in our work. You can refer to this Colab notebook for running model inference. Please feel free to reach out if you have any further questions.

thomas-xin commented 1 month ago

Hi Thomas,

Thank you for your interest in our work. You can refer to this Colab notebook for running model inference. Please feel free to reach out if you have any further questions.

Hi, Thanks for the quick response! The provided code worked immediately, but I ended up thinking it still distorted the cover image too visibly, and resulted in too much loss of detail. I figured out the issue with the training and validation scripts (using channel_axis=2 in liso.utils.calc_ssim rather than multichannel=True), but after attempting to train a custom checkpoint using mse-weight=3, a larger dataset input of 2779 images and 30 epochs, the resulting model did not appear to act any differently from the one in the example.

Is the model's size and architecture capable of achieving higher accuracy in matching the cover image? Apologies if these are silly questions, I'm still somewhat new to model training in general 😅

cxy1997 commented 1 month ago

Hi Thomas,

Thank you for your feedback. I appreciate your efforts in experimenting with the model and making adjustments.

Firstly, it's important to note that image steganography with JPEG compression presents additional challenges compared to PNG encoding. JPEG's lossy compression, which removes high-frequency components to reduce file size, inherently makes it more difficult to preserve the hidden message without visible distortions. For improved image quality, I would recommend using LISO-PNG (the default setting) models.

Method Error (%) ↓ PSNR ↑
LISO-PNG 4E-5 33.83
LISO-PNG + L-BFGS 0.00 33.12
LISO-JPEG 6E-2 19.72

Evaluated on div2k validation set with 1 bit encoded in each pixel.

Regarding the results of your training, in our experiments we do not observe substantial performance gain from increased dataset size or number of training epochs. However, the trade-off between image quality and decoding accuracy and be controlled with the mse-weight parameter.

Please feel free to reach out if you require further clarification.

thomas-xin commented 1 month ago

Hi Thomas,

Thank you for your feedback. I appreciate your efforts in experimenting with the model and making adjustments.

Firstly, it's important to note that image steganography with JPEG compression presents additional challenges compared to PNG encoding. JPEG's lossy compression, which removes high-frequency components to reduce file size, inherently makes it more difficult to preserve the hidden message without visible distortions. For improved image quality, I would recommend using LISO-PNG (the default setting) models.

Method Error (%) ↓ PSNR ↑ LISO-PNG 4E-5 33.83 LISO-PNG + L-BFGS 0.00 33.12 LISO-JPEG 6E-2 19.72 Evaluated on div2k validation set with 1 bit encoded in each pixel.

Regarding the results of your training, in our experiments we do not observe substantial performance gain from increased dataset size or number of training epochs. However, the trade-off between image quality and decoding accuracy and be controlled with the mse-weight parameter.

Please feel free to reach out if you require further clarification.

Ah, I see. My main interest was in the jpeg mode, as that is in my opinion the main thing setting it apart from purely deterministic steganography methods like low bitplane substitution, and I was mostly curious whether the model would be capable of learning a way to distribute data more imperceptibly taking advantage of existing image content; such as within higher noise areas, along edges and colour transitions, etc. But I suppose after looking through the existing code, that also complicates the training process because how the image is perceived by humans is not quite the same as what MSE, PSNR or other metrics represent.

As usual, I appreciate the quick replies!