ProGamerGov / neural-style-pt

PyTorch implementation of neural style transfer algorithm
MIT License
833 stars 178 forks source link

Artifacts compared to Lua version #71

Open genekogan opened 4 years ago

genekogan commented 4 years ago

Coming back to this repo after a long break, interested in developing this further...

I was just comparing this implementation to the original Lua version and noticed something I hadn't before. The two produce almost identical results, but the PyTorch version appears to produce very subtle artifacts.

The following is an example, using Hokusai as the style image. On the left side is neural-style (lua), on the right side is neural-style-pt.

ml_hk_lua copy

Notice in scattered places the presence of high-frequency discolorations, often in almost checkerboard-like patterns. These do not appear in the Lua version. If you zoom in on a few parts of the neural-style-pt versions, you can see them clearly. Notice the pink and green checkers.

Screen Shot 2020-05-22 at 4 27 22 PM Screen Shot 2020-05-22 at 4 27 42 PM Screen Shot 2020-05-22 at 4 39 16 PM

This generally happens consistently for any combination of content and style images, although for some style images the artifacts are more obvious. Sometimes obvious discolorations appear, other times they are smaller, giving the output an almost grainy appearance. The artifacts can be reduced by increasing -tv_weight but at the expense of content/style reconstruction, and even then it's still visible.

I tried fixing it a few ways. Clamping the image between iterations (not just at the end) didn't fix it. I tried playing with the TVLoss module. For example, changing

self.loss = self.strength * (torch.sum(torch.abs(self.x_diff)) + torch.sum(torch.abs(self.y_diff)))

to an L2-loss, i.e.

self.loss = self.strength * (torch.sum(torch.pow(torch.pow(self.x_diff, 2)) + torch.sum(torch.pow(self.y_diff, 2)), 2))

also did not get rid of the artifact (I tried this because my reading of the TVLoss formula is that they use L2-loss not absolute values. But I'm not sure this makes a big difference.

The artifact is very subtle, but I'm hoping to fix it, as I'd like to produce more print-quality images in the future, and multi-stage or multi-scaled techniques on top may amplify the artifact. I wonder if you have any idea what might be causing this or what could potentially fix it.

ProGamerGov commented 4 years ago

@genekogan It's not immediately clear to me what is causing these artifacts, but I'll look into it and see if I can figure it out. Have you tested any of the other models? And have you ruled out artifacts from the input images (like JPEG artifacts)?

genekogan commented 4 years ago

Just tried with vgg16. At first glance, it wasn't as noticeable but my eye is not very trained to it yet, and I can't compare to the Lua version as I don't see VGG16 available for it?

As far as jpeg artifacts, it seems unlikely because the artifacts appear in different places each time you run it, and I'm using the exact same images for both the Lua and PyTorch versions.

ProGamerGov commented 4 years ago

The VGG-16 model for the Lua version can be found here: https://gist.github.com/ksimonyan/211839e770f7b538e2d8.

A list of many of the supported models for the Lua version can be found here: https://github.com/jcjohnson/neural-style/wiki/Using-Other-Neural-Models

The full list of models supported by neural-style-pt can be found here: https://github.com/ProGamerGov/neural-style-pt/wiki/Other-Models

ProGamerGov commented 4 years ago

Maybe there are layer implementation differences between Torch and PyTorch that could be causing the artifacts? Have you tried using resize convolutions (that are designed to deal with checker board artifacts)?

genekogan commented 4 years ago

I took a quick stab at this, by replacing at line 119:

if isinstance(layer, nn.Conv2d):
    net.add_module(str(len(net)), layer)

with

if isinstance(layer, nn.Conv2d):
    net.add_module(str(len(net)), nn.Upsample(scale_factor=2, mode='bilinear'))
    net.add_module(str(len(net)), layer)
    net.add_module(str(len(net)), nn.MaxPool2d(2))

and then doubling style_scale to compensate for the 2x upsampling. It doesn't solve the checkerboard problem for me, and is now also slower and more memory intensive. I made this change based on the discussion/advice here but I'm not sure I made the right change/understood it correctly. Need to read a bit deeper and try again, but wanted to put early attempts here.

ajhool commented 3 years ago

@genekogan ever have any success with this? One area that I had been looking in was in the deprocess code, which might differ slightly in how the normalization is done. I though t that the clamp might be doing it:

# Preprocess an image before passing it to a model.
# We need to rescale from [0, 1] to [0, 255], convert from RGB to BGR,
# and subtract the mean pixel.
def preprocess(image_name, image_size):
    image = Image.open(image_name).convert('RGB')
    if type(image_size) is not tuple:
        image_size = tuple([int((float(image_size) / max(image.size))*x) for x in (image.height, image.width)])
    Loader = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor()])
    rgb2bgr = transforms.Compose([transforms.Lambda(lambda x: x[torch.LongTensor([2,1,0])])])
    Normalize = transforms.Compose([transforms.Normalize(mean=[103.939, 116.779, 123.68], std=[1,1,1])])
    tensor = Normalize(rgb2bgr(Loader(image) * 256)).unsqueeze(0)
    return tensor

#  Undo the above preprocessing.
def deprocess(output_tensor):
    Normalize = transforms.Compose([transforms.Normalize(mean=[-103.939, -116.779, -123.68], std=[1,1,1])])
    bgr2rgb = transforms.Compose([transforms.Lambda(lambda x: x[torch.LongTensor([2,1,0])])])
    output_tensor = bgr2rgb(Normalize(output_tensor.squeeze(0).cpu())) / 256
    output_tensor.clamp_(0, 1)
    Image2PIL = transforms.ToPILImage()
    image = Image2PIL(output_tensor.cpu())
    return image
// original lua:

-- Preprocess an image before passing it to a Caffe model.
-- We need to rescale from [0, 1] to [0, 255], convert from RGB to BGR,
-- and subtract the mean pixel.
function preprocess(img)
  local mean_pixel = torch.DoubleTensor({103.939, 116.779, 123.68})
  local perm = torch.LongTensor{3, 2, 1}
  img = img:index(1, perm):mul(256.0)
  mean_pixel = mean_pixel:view(3, 1, 1):expandAs(img)
  img:add(-1, mean_pixel)
  return img
end

-- Undo the above preprocessing.
function deprocess(img)
  local mean_pixel = torch.DoubleTensor({103.939, 116.779, 123.68})
  mean_pixel = mean_pixel:view(3, 1, 1):expandAs(img)
  img = img + mean_pixel
  local perm = torch.LongTensor{3, 2, 1}
  img = img:index(1, perm):div(256.0)
  return img
end
genekogan commented 3 years ago

@ajhool I'm not sure if deprocess has much to do it with it because it's only called before display or saving the image to disk. The original tensor goes to the next iteration unclamped. Have you tried any changes?

ProGamerGov commented 3 years ago

@genekogan Can you still reproduce this issue on the latest version of PyTorch? I was trying to see if my gradient normalization code fixed it, but I can't even get the original artifacts to show up like they did before.

genekogan commented 3 years ago

Just tried it, with pytorch 1.7.

python neural_style.py -content_image examples/inputs/monalisa.jpg -style_image examples/inputs/hokusai.jpg -image_size 720 -num_iterations 1000 -backend cudnn

content, style, and output:

monalisa hokusai

out

Same thing but with -normalize_gradients.

out_ng

I'm still getting the checkerboard artifacts but normalizing the gradients seems to reduce those grayish de-saturated regions, which is very nice. I haven't found anything yet for the checkerboard artifacts.

genekogan commented 3 years ago

Another one, same as above but for

output4

and -normalize_gradients:

output3 starry_night.jpg

The artifacts are much less visible in this case. Perhaps its image-sensitive? Or maybe I used a low quality image before and it learned compression artifacts? Not sure, but here it's looking much better.

ProGamerGov commented 3 years ago

@genekogan I was testing with the brad_pitt.jpg example image and hokusai:

artifact_test_out

Tests with and without gradient normalization (some control tests were set to (strength * strength) to make the weights similar to normalize gradients):

There are artifacts in my test results, but they are nowhere near as bad as yours with the Mona Lisa.

In the original neural-style, I found that Adam produced grayish regions when it's parameters like the beta parameters were not optimal. PyTorch's Adam optimizer actually uses the same parameters that I figured out for neural_style.lua. Maybe L-BFGS needs it's parameters tuned in order to avoid creating grayish areas?

ProGamerGov commented 3 years ago

Another thing is that neural_style.lua appears to multiply the gradients by the content / style weights in the backward pass instead of dividing by them. So in order to reproduce that, I had to multiply the weights by themselves to cancel out the division by the weights that takes place in neural-style-pt's backward pass (because things are reversed in the backwards pass). If I didn't multiply the normalized gradients in the backward pass at all, the stylization was little to non existent. I may be wrong about this, but testing with it gave me very similar results to neural_style.lua.

    @staticmethod
    def backward(self, grad_output):
        grad_input = grad_output.clone()
        grad_input = grad_input / (torch.norm(grad_input, keepdim=True) + 1e-8)
        return grad_input * self.strength * self.strength, None
class ContentLoss(nn.Module):

    def forward(self, input):
        if self.mode == 'loss':
            loss = self.crit(input, self.target)
            if self.normalize:
                loss = ScaleGradients.apply(loss, self.strength)
            self.loss = loss * self.strength

https://github.com/ProGamerGov/neural-style-pt/blob/master/neural_style.py#L414

And from neural_style.lua:

function ContentLoss:updateGradInput(input, gradOutput)
  if self.mode == 'loss' then
    if input:nElement() == self.target:nElement() then
      self.gradInput = self.crit:backward(input, self.target)
    end
    if self.normalize then
      self.gradInput:div(torch.norm(self.gradInput, 1) + 1e-8)
    end
    self.gradInput:mul(self.strength)

https://github.com/jcjohnson/neural-style/blob/master/neural_style.lua#L480

Hopefully now that I've figured out how to modify things in the backward pass without breaking neural-style-pt, we can get closer to a solution to the artifacts issue! More information about PyTorch autograd functions (what ScaleGradients uses) can be found here.

genekogan commented 3 years ago

This is what I get using default parameters (regular first, then with -normalize_gradients).

output_bp_hokusai output_bp_hokusai_ng

The artifacts are still there, they are kind of mild (and with normalized gradients even milder). Your output seem to have no artifacts though. I'm a bit puzzled why we are getting differing results. Are you using ADAM, or is it the default settings for L-BFGS? I'm on the latest version of your master branch and pytorch 1.7. Maybe something even lower is causing the disparity?

@alexjc wrote earlier a bit about the issue of muddy regions in this thread and this one. The paper by Heitz et al says Gram loss fails to represent the whole style feature distribution and suggest that swapping in Sliced Wasserstein Distance instead may help reduce or eliminate the muddy "desaturated" zones, and I also suspect it could improve the issue I was struggling with earlier of blending styles together. I tried to implement SWD the other day but ran into issues. I recall swapping in histogram loss helped a bit but did not solve it, I'm curious if this looks interesting to you.

ProGamerGov commented 3 years ago

@genekogan These are the different parameters that I used for testing with both Adam and L-BFGS, I think (the tests above are all with L-BFGS, and master branch with no changes):

-init image -style_weight 6000 -content_weight 100

I think for the tests above, I may have also forgotten to set tv weight to zero, which I normally do. I also have in the past used a learning rate of 2 with the Adam optimizer in the original neural-style, but I'm not sure if I ever did the same with neural-style-pt.

ProGamerGov commented 3 years ago

Are these the same artifacts?

crop3600

crop1200

Because spatial decorrelation (FFT parameterization) and color decorrelation seem to make them worse, and transform robustness just ends up creating outs that aren't stylized in a way that we are looking for.

ProGamerGov commented 3 years ago

@genekogan I have implemented spatial decorrelation, color decorrelation, and transform robustness into neural-style-pt here to see if they could help resolve the artifacts, but it appears to have them so much worse as you can see above: https://gist.github.com/ProGamerGov/7294364e7e58d239fb1a8c0ae8a0957e

The main area in the script where you can adjust parameters is here. Be warned that there appear to some bugs related to the size of your chosen content image and the FFT class. I still have to fix those for Captum and my other projects.

The code is based on dream-creator and my work on Captum.

There's a lot to experiment with, and you'll have to comment out / un-comment stuff to add or remove it for testing as the new features don't have argparse parameters. A good starting point might be Lucid's style transfer stuff: https://colab.research.google.com/github/tensorflow/lucid/blob/master/notebooks/differentiable-parameterizations/style_transfer_2d.ipynb

genekogan commented 3 years ago

@ProGamerGov The artifacts in your above images look different, and look more like the kind of artifacts that you get when TV regularization is set too low. The ones in the initial post are more sporadic and spaced out.

The artifacts seem to be lesser when style weight is increased relative to content weight, but I haven't tested that thoroughly. I will try out the the links.

ProGamerGov commented 3 years ago

So, the Mona Lisa image uses the sRGB IEC61966-2.1 color profile and when PIL loads it and converts it to an RGB image there's a slight change in the colors. But that doesn't explain the artifacts and why changing the weights influences their prevalence. I also think that the artifacts in your images almost look like ISO noise artifacts, and that they may have been somewhat aligned due to the optimization process.

ProGamerGov commented 3 years ago

@genekogan Since your artifacts don't seem to resemble the checkerboard artifacts from conv layers, maybe the issue comes from the pooling layers? Figure 3 from this research paper: https://arxiv.org/abs/1511.06394 seems to show examples that look like your artifacts.

VGG pooling artifacts (click on the images to make them bigger):

If we wanted to test it, we'd have to replace the pooling layers with L2 pooling layers. But I'm not sure how to turn the equation they give: L2(x) =√g∗x2 into a PyTorch class.

where the squaring and square-root operations are point-wise, and the blurring kernel g(.) is chosen as a 6×6 pixel Hanning window that approximately enforces the Nyquist criterion. This type of pooling is often used to describe the behavior of neurons in primary visual cortex (Vintch et al).

ProGamerGov commented 3 years ago

So, I can implement my own version of MaxPool2d in one of two ways:

class MaxPool2d(torch.nn.MaxPool2d):
    def forward(self, x):
        x = x.unfold(2, self.kernel_size, self.stride).unfold(3, self.kernel_size, self.stride)
        x = x.contiguous().view(x.size()[:4] + (-1,))
        pool, indices = torch.max(x, dim=-1)
        return pool

Or:

from torch.nn.modules.utils import _pair, _quadruple
class MaxPool2d(torch.nn.Module):

    def __init__(self, kernel_size=2, stride=2, padding=0):
        super(MaxPool2d, self).__init__()
        self.kernel_size = _pair(kernel_size)
        self.stride = _pair(stride)
        self.padding = _quadruple(padding)

    def forward(self, x):
        x = F.pad(x, self.padding, mode='reflect')
        x = x.unfold(2, self.kernel_size[0], self.stride[0]).unfold(3, self.kernel_size[1], self.stride[1])
        x = x.contiguous().view(x.size()[:4] + (-1,))
        pool, indices = torch.max(x, dim=-1)
        return pool

But I'm not sure how to do the l2 pooling, or how to apply blur. Bluring with Conv2d seems to alter the size of the input.

Edit, I have bluring setup for a MaxPool2d layer here:

import math 

class GaussianBlur(nn.Module):

    def __init__(self, kernel_size=6, sigma = math.pi / 2):
        super().__init__()
        if type(sigma) is not list and type(sigma) is not tuple:
            kernel_size = [kernel_size] * 2
        if type(sigma) is not list and type(sigma) is not tuple:
            sigma = [sigma] * 2

        kernel = 1
        meshgrid_tensor = torch.meshgrid([torch.arange(size, dtype=torch.float32) for size in kernel_size])

        for size, std, mgrid in zip(kernel_size, sigma, meshgrid_tensor):
            kernel *= 1 / (std * math.sqrt(2 * math.pi)) * \
            torch.exp(-((mgrid - ((size - 1) / 2)) / std) ** 2 / 2)
        self.kernel = (kernel / torch.sum(kernel)).view(1, 1, *kernel.size()).cuda()

    def forward(self, x):
        assert x.dim() == 4
        groups = x.size(1)
        weight = self.kernel.repeat(groups, * [1] * (self.kernel.dim() - 1))
        x = F.pad(x, (3,2,3,2), mode='reflect') # No idea if this is a good idea for keeping input the same size
        x = F.conv2d(x, weight=weight, groups=groups)
        return x

blur_input = GaussianBlur(6, sigma = 0.25)
class MaxPool2d(torch.nn.MaxPool2d):
    def forward(self, x):
        x = blur_input(x)
        x = x.unfold(2, self.kernel_size, self.stride).unfold(3, self.kernel_size, self.stride)
        x = x.contiguous().view(x.size()[:4] + (-1,))
        pool, _ = torch.max(x, dim=-1)
        return pool

maxpool2d_blurred_layer = MaxPool2d(kernel_size=2, stride=2)
genekogan commented 3 years ago

Cool! I tried this out -- the results are very striking. I'm still getting artifacts, but they are almost gone with the normalized gradients option. The features also look much sharper and more saturated, and the muddy regions are also reduced, especially with the tv regularization turned off.

Regular, default parameters with blurred maxpool2d:

hokusai_blur_720

With normalized gradients:

hokusai_blur_720_ng

With tv_weight = 0 (un-normalized):

hokusai_blur_720_tv0

With normalized gradients and tv_weight = 0:

hokusai_blur_720_ng_tv0

genekogan commented 3 years ago

One hacky idea I have that could help balance the tradeoff between checkerboard artifacts/high-freq noise (which seem to reduce with increased TV regularization) and muddy regions (which reduce with decreased TV regularization) would be to modify the TVLoss by first multiplying it element-wise with a saturation map (or something similar) of the image before summing it all together.

So instead of:

 self.loss = self.strength * (torch.sum(torch.abs(self.x_diff)) + torch.sum(torch.abs(self.y_diff)))

Something like:

X = torch.abs(self.x_diff)) + torch.sum(torch.abs(self.y_diff)
self.loss = self.strength * (torch.sum(z))

xd = torch.mul(S, torch.abs(self.x_diff))
yd = torch.mul(S, torch.abs(self.y_diff))
self.loss = self.strength * (torch.sum(xd) + torch.sum(yd))

where S is some measure of the local saturation around that pixel (or rather desaturation, since you want to increase the TV penalty around the desaturated regions). For example, S(x,y) could be inverse of the standard deviation in the m*n square block centered at x,y.

Or even simpler would be to use L2 sum instead of L1, i.e. square/raise each element in torch.abs(self.x_diff) and torch.abs(self.y_diff) to some power to penalize really big differences between pixels.

I'm not sure if there's much to gain from trying to optimize TV noise, as the effect is pretty subtle, and maybe the artifacts aren't even my biggest problem anymore, but some untrained/half-baked food for thought.

Sankyuubigan commented 3 years ago

Hi guys. I am not very good at your code, could you please provide the code of the neural_style.py file with ready changes maxpool2d or genekogan's tv optimization? How to use this code?