crowsonkb / k-diffusion

Karras et al. (2022) diffusion models for PyTorch
MIT License
2.26k stars 372 forks source link

Use more standard fid/kid calculation #8

Closed samedii closed 2 years ago

samedii commented 2 years ago

7

Thanks for providing this very well written code, it has been very nice to read and compare :D

crowsonkb commented 2 years ago

Had to use cleanfid's numpy calculation for FID since the results were quite different

How different btw? (Does mine have a bug or is it up to fp32 vs fp64? or is it the epsilon?)

We need to use the dataloader to load images instead of loading them from a directory because we may not have a directory, I am going to add webdataset later which we just have an iterator for, not a directory. Also we need to compute the features in parallel for speed, does your code do that?

crowsonkb commented 2 years ago

My KID is less stochastic bc I don't partition the data into subsets to compare pairwise yet, KID for 50,000 reals vs 50,000 fakes would require a 50,000x50,000 matrix to be materialized and that's kind of big so usually people take random subsets to compare pairwise and average the results.

crowsonkb commented 2 years ago

As for FID I use a different formula for Wasserstein-2 distance between multivariate Gaussian distributions, given on https://djalil.chafai.net/blog/2010/04/30/wasserstein-distance-between-two-gaussians/, in order to explicitly handle the case where the covariance matrices do not commute (sigma_1 sigma_2 != sigma_2 sigma_1), since I have seen this in practice, the usual formula is only valid when they commute. I don't know if this is the source of the difference, maybe I just have a bug somewhere else or it is because I am doing the calculation in fp32. :)

crowsonkb commented 2 years ago

in order to explicitly handle the case where the covariance matrices do not commute (sigma_1 sigma_2 != sigma_2 sigma_1), since I have seen this in practice

Apparently their formula does work, you just have to use a sqrtm routine that handles non-symmetric matrices. Since sqrtm_eig() only works on SPD matrices I had to use the other formula instead, I can try a different sqrtm if that's the problem.

crowsonkb commented 2 years ago

I think I fixed sqrtm_eig() to work on nonsymmetric matrices that have real eigenvalues and eigenvectors but I need to test it, and I'm still not sure if that's the source of the difference.

class _MatrixSquareRootEig(torch.autograd.Function):
    @staticmethod
    def forward(ctx, a):
        vals, vecs = torch.linalg.eig(a)
        vals, vecs = vals.real, vecs.real
        ctx.save_for_backward(vals, vecs)
        return vecs @ vals.abs().sqrt().diag_embed() @ vecs.transpose(-2, -1)

    @staticmethod
    def backward(ctx, grad_output):
        vals, vecs = ctx.saved_tensors
        vals, vecs = vals.real, vecs.real
        d = vals.abs().sqrt().unsqueeze(-1).repeat_interleave(vals.shape[-1], -1)
        vecs_t = vecs.transpose(-2, -1)
        return vecs @ (vecs_t @ grad_output @ vecs / (d + d.transpose(-2, -1))) @ vecs_t

def sqrtm_eig(a):
    if a.ndim < 2:
        raise RuntimeError('tensor of matrices must have at least 2 dimensions')
    if a.shape[-2] != a.shape[-1]:
        raise RuntimeError('tensor must be batches of square matrices')
    return _MatrixSquareRootEig.apply(a)
crowsonkb commented 2 years ago

One advantage of resizeright is that it is clean/good and works on GPU, PIL's floating point resize happens on CPU and you essentially have to use a thread pool to do a lot of resizes in parallel which is a pain (I have done this before). I am not concerned with it matching cleanfid 100% but it should be close.

crowsonkb commented 2 years ago
  • Resize is using cleanfid's PIL resizing. resize_right was relatively close (a_tol=1e-3) but this still changes FID too much

As I commented elsewhere you need to clamp the resizeright output, it leaves the overshoot from bicubic/etc. in by default because clamping makes the gradient wrt the image 0 for out of range values and it is intended to be differentiable. :)

crowsonkb commented 2 years ago

What FID are you supposed to get on an untrained 32x32 model (samples are noise) btw?

samedii commented 2 years ago

As I commented elsewhere you need to clamp the resizeright output, it leaves the overshoot from bicubic/etc. in by default because clamping makes the gradient wrt the image 0 for out of range values and it is intended to be differentiable. :)

The closest results I have gotten are with clamping :) It passes atol=1e-3 when the images are scaled to 0-255 so it's quite close. https://github.com/samedii/nicefid/blob/master/nicefid/resize.py

I tested resize_right again and I'm getting FID 310.2350871755488 (reference) vs 310.43759267144617 (resize right) on my very small test dataset so that could be fine.

The features are significantly different though: [0.1625934 , 0.09377331, 0.14070226, ... vs [0.1617, 0.0937, 0.1421, ...

I think it could be viable to switch to resize_right but I would want to test on another dataset too.

We need to use the dataloader to load images instead of loading them from a directory because we may not have a directory, I am going to add webdataset later which we just have an iterator for, not a directory. Also we need to compute the features in parallel for speed, does your code do that?

The library allows you to input an iterator that yields batches of NCHW tensors so you don't need to save the images to a directory first Features.from_iterator(...). The features are calculated for a batch at a time. Is that sufficient?

I also added support for input:ing the model manually because I saw you wrapping your models with accelerator (but we don't want the lib to do that by default) but I'm not sure if that's something that's actually needed. I'm not familiar with accelerator.

As for FID I use a different formula for Wasserstein-2 distance between multivariate Gaussian distributions, given on https://djalil.chafai.net/blog/2010/04/30/wasserstein-distance-between-two-gaussians/, in order to explicitly handle the case where the covariance matrices do not commute (sigma_1 sigma_2 != sigma_2 sigma_1), since I have seen this in practice, the usual formula is only valid when they commute. I don't know if this is the source of the difference, maybe I just have a bug somewhere else or it is because I am doing the calculation in fp32. :)

I don't think I tried using double there but I'll keep it in mind. The difference is too large now for it to be the cause at the moment. It is FID 307 vs 310.

I think I fixed sqrtm_eig() to work on nonsymmetric matrices that have real eigenvalues and eigenvectors but I need to test it, and I'm still not sure if that's the source of the difference.

I started trying another implementation of sqrtm but gave up. I think it is part of the issue. I changed eps and it didn't help.

I can have a go with your new implementation and see how it looks.

crowsonkb commented 2 years ago

I tested resize_right again and I'm getting FID 310.2350871755488 (reference) vs 310.43759267144617 (resize right) on my very small test dataset so that could be fine.

How are you getting that, I tried picking the CleanFID feature extractor out and I am getting values like 18 or 43, and KID values that indicate no difference between random noise fakes and reals...!

crowsonkb commented 2 years ago
from cleanfid.inception_pytorch import InceptionV3

class InceptionV3FeatureExtractor(nn.Module):
    def __init__(self, device='cpu'):
        super().__init__()
        self.model = InceptionV3(resize_input=False, normalize_input=False).to(device)
        self.size = (299, 299)

    def forward(self, x):
        if x.shape[2:4] != self.size:
            x = resize(x, out_shape=self.size, pad_mode='reflect').clamp(-1, 1)
        return self.model(x * 127.5 + 127.5)[0][:, :, 0, 0]

Something's really wrong with this feature extractor and I can't figure out what. :/

samedii commented 2 years ago

Updates:

Something's really wrong with this feature extractor and I can't figure out what. :/

You are looking at the wrong model though. They are using InceptionV3W in "clean" mode.

How are you getting that, I tried picking the CleanFID feature extractor out and I am getting values like 18 or 43, and KID values that indicate no difference between random noise fakes and reals...!

Did you see that their model expects values between 0-255?

What FID are you supposed to get on an untrained 32x32 model (samples are noise) btw?

Not sure but I've seen values very early on in training above 300.

crowsonkb commented 2 years ago

Ah, so:

class InceptionV3FeatureExtractor(nn.Module):
    def __init__(self, device='cpu'):
        super().__init__()
        path = Path(os.environ.get('XDG_CACHE_HOME', Path.home() / '.cache')) / 'k-diffusion'
        path.mkdir(parents=True, exist_ok=True)
        self.model = InceptionV3W(str(path), resize_inside=False).to(device)
        self.size = (299, 299)

    def forward(self, x):
        if x.shape[2:4] != self.size:
            x = resize(x, out_shape=self.size, pad_mode='reflect')
        x = (x * 127.5 + 127.5).clamp(0, 255)
        return self.model(x)

?

samedii commented 2 years ago

Not quite, see here https://github.com/GaParmar/clean-fid/blob/main/cleanfid/inception_torchscript.py

crowsonkb commented 2 years ago

Do you mean the normalization difference? 127.5 is how you actually get from a -1-1 range image to a 0-255 range image, I assume the 128 value in clean-fid is just wrong?

samedii commented 2 years ago

Ah my bad, I just didn't see that you were wrapping the cleanfid implementation.

Edit: Ah and you're doing resize inside. Clamping should be right then :)

crowsonkb commented 2 years ago

Thank you so much for your feedback/correctness checks on my clean-fid wrapper (and for motivating me to complete it!)

samedii commented 2 years ago

Happy to help in some way ^^