AlessioGalluccio / FastFlow

an implementation of the architecture of FastFlow (Jiawei Yu et al.)
MIT License
40 stars 13 forks source link

Q&A #14

Open mjack3 opened 2 years ago

mjack3 commented 2 years ago

Hello.

I would like to open this issue to talk about this project. I am also interested in developing this project and would be great to share information as the paper doesn't give deeply information about the implementations and offical code is no available.

If you are agree with this iniciative, firstly we could simplify the project to use Wide-ResNet50 in order to get comparative results with the previous researching. I would like to start from the begining of the paper when says:

For ResNet, we directly use the features of the last layer in the first three blocks, and put these features into three corresponding FastFlow model.

This make me thing that in the implementation we need to use the features after the input layer, layer 1 and layer 2. In this way this table 6 makes sense

image

But can not to imagine how to concatenate this information for make it sense with the next

In the forward process, it takes the feature map from the backbone network as input image

Depending of what part you read, it seems that just one feature map or 3 are taken

AlessioGalluccio commented 2 years ago

Hi @mjack3, I'm really glad to find some help in this project. Thank you very much for your proposal, I accept. This paper is quite obscure. The problem you are addressing is explained in paragraph 4.7:

For ResNet18 and Wide-ResNet50-2, we directly use the features of the last layer in the first three blocks, put these features into the 2D flow model to obtain their respective anomaly detection and localization results, and finally take the average value as the final result.

I think that the paper wants us to build three different models and average their anomaly score. But how do we compute this anomaly score? This is the question that I can't solve. In the introduction we can find that:

We propose a 2D normalizing flow denoted as FastFlow for anomaly detection and localization with fully convolutional networks and two-dimensional loss function to effectively model global and local distribution.

But I can't find how this two-dimensional loss is defined. If you have an idea of good two-dimensional loss for this problem, I'm all ears. Best, Alessio

mjack3 commented 2 years ago

hummm yes you are right, definitevly we need to create 3 fastflow models..i will try. By the way, you can find my implementation here https://github.com/mjack3/EasyFastFlow feel you free to use what you want

mjack3 commented 2 years ago

Have you try contacting to some of the main authors of the paper? I googled them but didn't find the email

Howeng98 commented 2 years ago

@mjack3 Hi, have you take a look about the CFLOW-AD? It also implemented by FLOW model, maybe it can help you to understand how 3 Fastflow module work. I'm trying to implement Fastflow by modify Cflow-AD. If you need any help or discuss, I would like to help (if I can).

mjack3 commented 2 years ago

@Howeng98 you are welcome =)

Yes I also looked the CSFLOW-AD code but I am not sure if here, we need to create 3 individual fastFLow model and training with 3 optimizers (one for FastFLow) or doing similar to CSFLOW-AD

AlessioGalluccio commented 2 years ago

@mjack3 I tried to contact Yushuang Wu through a university e-mail I found, but I got no answer. I haven't found the e-mail of the other authors

mjack3 commented 2 years ago

When did you contact them ?@AlessioGalluccio

rafalfirlejczyk commented 2 years ago

Hi @mjack3, Can you please share your implementation of FastFlow? The link seems to be deactivated. Thanks

mjack3 commented 2 years ago

Currently i am obliged to make the code in private because my job contract. I hope to open it soon. Anyway I will share information in this same thread if is needed :)

maaft commented 2 years ago

@AlessioGalluccio just a small remark: For anomaly score calculation (global and pixelwise) you need to use p(z) and not z which you are currently using.

you can estimate logp(z) (and therefore p(z)) analogous to the pytorch implementation of CFlow AD.

gathierry commented 2 years ago

Hi @maaft, did you manage to achieve a similar result as the claimed? I tried both the way of CFlow and DifferNet but still far below the performance in the paper.

Another confusion for me is that I cannot get the same A.d param#: I take each FlowStep as one AllInOneBlock from FrEIA, with 2 convolution layers This is my counting result (and paper counting result in parentheses)

CaiT:  7,043,780 (14.8M)
DeiT:  7,043,780 (14.8M)
Resnet18:  4,650,240 (4.9M)
WideResnet50:  41,309,184 (41.3M) -> this one is matched

Here's code I used to compute param#

def count_params_per_flow_step(k, cin, ratio):
    cout = 2 * cin
    cmed = int(cin * ratio)
    w1 = k * k * cin * cmed
    b1 = cmed
    w2 = k * k * cmed * cout
    b2 = cout
    return w1 + w2 + b1 + b2

def count_total_params(num_steps, conv3x3_only, feature_channels, ratio):
    s = 0
    for channels in feature_channels:
        for i in range(num_steps):
            k = 1 if (i % 2 == 1 and not conv3x3_only) else 3
            s += count_params_per_flow_step(k, channels // 2, ratio)
    return s

print("CaiT: ", count_total_params(20, False, [768], 0.16))
print("DeiT: ", count_total_params(20, False, [768], 0.16))
print("Resnet18: ", count_total_params(8, True, [64, 128, 256], 1.0))
print("WideResnet50: ", count_total_params(8, False, [256, 512, 1024], 1.0))
maaft commented 2 years ago

@gathierry no, I don't think that I can match the scores in the paper (didn't evaluate it yet, only visually). In particular, the transistor class (broken legs) does not learn at all.

I'll evaluate auroc etc next week and report back.

Also I tried different backbones than resnet18, that achieve higher accuracy on imagenet (e.g. EfficientNet) and noticed that the training has a very hard time to converge at all. No idea, why this is the case.

brm738 commented 2 years ago

My tests of this code (24 Epochs) shows acceptable results only for Resnet18 and only for three mvtec classes:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | AUROC-MAX | AUCPR-MAX -- | -- | -- Bottle | 0.9849 | 0.9955 Screw | 0.9859 | 0.9959 Wood | 0.9956 | 0.9987

Other classes performed badly. I did not test it for WideResnet50 yet. Feature extractors based on Vision Transformers like Deit or Cait does not learn at all.

maaft commented 2 years ago

Thats weird, right? Do the resnet18 features follow some kind of special/nice distribution that the other architectures don't have?

Has anyone tried different feature extractors with CFLOW-AD or other flow-based approaches?

maaft commented 2 years ago

I found out that resnet18 works well, because the extracted features have a low magnitude. When I use e.g. EfficientNet and just scale the features by 0.1, the NF-Head seems to learn quite well.

I'll try to add a learnable scaling parameter to make my model backbone agnostic. Doesn't work - features will collapse to 0.

maaft commented 2 years ago

Another thing: According to the architecture image (fig 2) from the paper, I think we should use RNVPCouplingBlock and not AllInOneBlock. The former includes two alternating coupling networks, while the latter is only single sided.

Furthermore, the AllInOneBlock applies ActNorm and PermuteRandom in the end of the coupling block and not in the beginning. We need to add those therefore manually before every RVNPCouplingBlock.

Does anyone know if the permutation indeed needs to be fixed during training? Or do we need to use a different permutation at every training step? I'm asking because The PermuteRandom Module from FrEIA is fixed during training.

Edit: Does it really matter though? I think the reason for alternating coupling networks for RealNVP was to also train the upper half of channels.. But when permuting randomly multiple times, we also train every channel. Hm, I'm a bit clueless here.

Edit2: ActNorm on beginning is paramount. When you do this, all backbones will work like magic. No manual scaling needed.

gathierry commented 2 years ago

I think PermuteRandom is actually a more flexible lower-half/upper-half alternating so essentially, I don't feel big difference. I was also trying to figure out if it's a coupling block or AllInOneBlock from the A.D params in Table. 1. But as mentioned earlier, I can never match all of them.

Based on our experiments, PermuteRandom must be fixed since initialization. Otherwise, the NF cannot learn anything useful.

Another thing: According to the architecture image (fig 2) from the paper, I think we should use RNVPCouplingBlock and not AllInOneBlock. The former includes two alternating coupling networks, while the latter is only single sided.

Furthermore, the AllInOneBlock applies ActNorm and PermuteRandom in the end of the coupling block and not in the beginning. We need to add those therefore manually before every RVNPCouplingBlock.

Does anyone know if the permutation indeed needs to be fixed during training? Or do we need to use a different permutation at every training step? I'm asking because The PermuteRandom Module from FrEIA is fixed during training.

Edit: Does it really matter though? I think the reason for alternating coupling networks for RealNVP was to also train the upper half of channels.. But when permuting randomly multiple times, we also train every channel. Hm, I'm a bit clueless here.

Edit2: ActNorm on beginning is paramount. When you do this, all backbones will work like magic. No manual scaling needed.

maaft commented 2 years ago

Yes, I think you are right.

To match parameters: Which layers are you using from the resnet?

Per paper:

the only free variable to play with in this case is the number of mid-channels for both subnets.

Unfortunately my GPU memory is too small to use first 3 image features. Please let me know if you can achieve any good results with above configuration.

Count Parameters with:

nf_params = sum(p.numel() for p in self.nf.parameters() if p.requires_grad) # self.nf is the flow head
mjack3 commented 2 years ago

I opened a Q in the Freia github

https://github.com/VLL-HD/FrEIA/issues/113

gathierry commented 2 years ago

@maaft

I tried to move Permute and ActNorm from the end to the beginning of the block, as you suggested, but I didn't see significant improvement. Maybe there are some other issues in my code.

mjack3 commented 2 years ago

@maaft

  • I think the "first 3 blocks" for resnet18 means stride4x, 8x, and 16x, so the channel numbers should be (64, 128, 256). See table 6.

  • In fact, in section 6.1 and caption of Table7, the paper indicates the mid-channel numbers in subnets

I tried to move Permute and ActNorm from the end to the beginning of the block, as you suggested, but I didn't see significant improvement. Maybe there are some other issues in my code.

I am getting NaN when ActNorm is at the beginning of the block in a innSeq. Could you share an image?

maaft commented 2 years ago

I guess I could share my model later. No idea why you get nans.

Maybe your data is already bad and contains nans? Are you normalizing your images?

mjack3 commented 2 years ago

Btw, for resNet the output of layer1, layer2 and layer3 are used.

Currently, i got a model that achieve [0.98, 1.0] with 25 epochs instead 500 In Clasification, for every class.

It needs some adjusts but hope to open the code soon for community participation.

Note: the code of this repo is wrong (sorry)

mjack3 commented 2 years ago

I guess I could share my model later. No idea why you get nans.

Maybe your data is already bad and contains nans? Are you normalizing your images?

For that, i am emulating the process:

x = torch.rand(16,3,256,256)

o=model(x)

But yes, I tested with the real normalized image in a pytorch-standard way

AlessioGalluccio commented 2 years ago

The permutation of channels must be fixed during training. As @gathierry mentioned, it's necessary for normalizing flows.

@AlessioGalluccio just a small remark: For anomaly score calculation (global and pixelwise) you need to use p(z) and not z which you are currently using.

you can estimate logp(z) (and therefore p(z)) analogous to the pytorch implementation of CFlow AD.

For the anomaly score I apply anomaly_score.append(t2np(torch.mean(z_grouped_temp ** 2, dim=(-2, -1)))) As it is used in DifferNet. Do you mean that I should add a /2 to it to be the same as the negative loglikelihood of a normal function?

AlessioGalluccio commented 2 years ago

Looking a CFlow AD, it does in utils.py logp = C * _GCONST_ - 0.5*torch.sum(z**2, 1) + logdet_J He computes the positive likelihood instead of the negative one. In fact, he calculates the score, not the anomaly score. Then he computes in train.py

# invert probs to anomaly scores
        super_mask = score_mask.max() - score_mask

In this way he gets the anomaly score. So, it's basically the same. I think that adding the jacobian in the anomaly score is useless, since it is the same for every output. The jacobian depends on the weights of the net, not on the input image

maaft commented 2 years ago

I'll share my model, loss function and anomaly map generation tomorrow

gathierry commented 2 years ago

@AlessioGalluccio In CFlow, there's an exponential converting logp to p as well. It's the same if there's only one feature level (such as DeiT and CaiT). But if there are 3 feature levels (resnet), it would be different since exp is performed before sum of three score maps in three levels. logp is in (-inf, 0] but p is in [0, 1], sum(log_p) and sum(p) can result in totally different values

gathierry commented 2 years ago

And for logp = C * _GCONST_ - 0.5*torch.sum(z**2, 1) + logdet_J. Does it make sense if we only reduce dim=1 when doing sum on logdet_J? I subclassed AllInOneBlock to keep the axes of H and W

class AllInOneBlock2D(Fm.AllInOneBlock):
    def __init__(self, dims_in, **kwargs):
        super().__init__(dims_in, **kwargs)
        self.sum_dims = (1,)
gathierry commented 2 years ago

@mjack3 As for ActNorm, I simply moved the _permute of AllInOneBlock to the beginning of forward and removed the original ones. I don't think this is the root cause of NaN but it might somehow amplify your gradient.

def forward(self, x, c=[], rev=False, jac=True):
        '''See base class docstring'''
        if self.householder:
            self.w_perm = self._construct_householder_permutation()
            if rev or self.reverse_pre_permute:
                self.w_perm_inv = self.w_perm.transpose(0, 1).contiguous()
        # ==== ActNorm ====
        x0, global_scaling_jac = self._permute(x[0], rev=False)
        # ==== ActNorm end ====
        x1, x2 = torch.split(x0, self.splits, dim=1)

        if self.conditional:
            x1c = torch.cat([x1, *c], 1)
        else:
            x1c = x1

        if not rev:
            a1 = self.subnet(x1c)
            x2, j2 = self._affine(x2, a1)
        else:
            a1 = self.subnet(x1c)
            x2, j2 = self._affine(x2, a1, rev=True)

        log_jac_det = j2
        x_out = torch.cat((x1, x2), 1)

        # add the global scaling Jacobian to the total.
        # trick to get the total number of non-channel dimensions:
        # number of elements of the first channel of the first batch member
        n_pixels = x_out[0, :1].numel()
        log_jac_det += (-1)**rev * n_pixels * global_scaling_jac

        return (x_out,), log_jac_det
mjack3 commented 2 years ago

I opened my code.

https://github.com/mjack3/FastFlow-AD

There i opened a Q&A section where discuss

maaft commented 2 years ago

Here's my model that I'm using. Due to memory limitations I'm only using layer2 and layer3 outputs from resnet. You can uncomment feature_dims[0], image_sizes[0] and change features = self.fe.forward(image)[2:-1] to features = self.fe.forward(image)[1:-1] to also include layer1 features.

from FrEIA.framework.sequence_inn import SequenceINN
import FrEIA.modules as Fm
import FrEIA.framework as Ff
from typing import List, Tuple
from torch import Tensor
import torch.linalg
from typing import List, Optional, Tuple
import torch
import torch.nn.functional as F
import torch.nn as nn
import timm

_GCONST_ = -0.9189385332046727  # ln(sqrt(2*pi))

def z_to_logpz(z: torch.Tensor):
    return _GCONST_ - 0.5 * torch.mean(z**2, dim=1, keepdim=True)

def get_loss(p_u: torch.Tensor, logdet_j: torch.Tensor) -> torch.Tensor:
    loss = torch.mean(
        0.5*torch.sum(p_u ** 2, dim=(1, 2, 3)) - logdet_j)
    return loss

def mid_chan(dims_in: int, dims_out: int):
    return dims_in

def subnet_conv1(dims_in: int, dims_out: int):
    """1x1 conv subnetwork to predicts the affine coefficients.

    Args:
        dims_in (int): input dimensions
        dims_out (int): output dimensions

    Returns:
        nn.Sequential: Feed-forward subnetwork
    """
    kernel_size = 1
    return nn.Sequential(
        nn.Conv2d(dims_in, mid_chan(dims_in, dims_out),
                  kernel_size=kernel_size),
        nn.ReLU(),
        nn.Conv2d(mid_chan(dims_in, dims_out),
                  dims_out, kernel_size=kernel_size)
    )

def subnet_conv3(dims_in: int, dims_out: int):
    """3x3 conv subnetwork to predicts the affine coefficients.

    Args:
        dims_in (int): input dimensions
        dims_out (int): output dimensions

    Returns:
        nn.Sequential: Feed-forward subnetwork
    """
    kernel_size = 3
    return nn.Sequential(
        nn.Conv2d(dims_in, mid_chan(dims_in, dims_out),
                  kernel_size=kernel_size, padding=1),
        nn.ReLU(),
        nn.Conv2d(mid_chan(dims_in, dims_out), dims_out,
                  kernel_size=kernel_size, padding=1)
    )

def fastflow_head(coupling_blocks: int, clamp_alpha: float, c: int, h: int, w: int) -> SequenceINN:
    """Create invertible decoder network.

    Args:
        condition_vector (int): length of the condition vector
        coupling_blocks (int): number of coupling blocks to build the decoder
        clamp_alpha (float): clamping value to avoid exploding values
        n_features (int): number of decoder features

    Returns:
        SequenceINN: decoder network block
    """
    coder = Ff.SequenceINN(c, h, w)
    clamp_activation = "ATAN"

    for _ in range(coupling_blocks // 2):
        coder.append(
            Fm.ActNorm
        )
        coder.append(
            Fm.PermuteRandom
        )
        coder.append(
            Fm.GLOWCouplingBlock,
            subnet_constructor=subnet_conv3,
            clamp=clamp_alpha,
            clamp_activation=clamp_activation,
        )
        coder.append(
            Fm.ActNorm
        )
        coder.append(
            Fm.PermuteRandom
        )
        coder.append(
            Fm.GLOWCouplingBlock,
            subnet_constructor=subnet_conv1,
            clamp=clamp_alpha,
            clamp_activation=clamp_activation,
        )
    return coder

class AnomalyMapGenerator(nn.Module):
    """Generate Anomaly Heatmap."""

    def __init__(self, image_size: Tuple[int, int]):
        super(AnomalyMapGenerator, self).__init__()
        self.image_size = image_size

    def forward(self, distributions: List[torch.Tensor]) -> torch.Tensor:
        score_map = torch.zeros(
            distributions[0].shape[0], 1, self.image_size[0], self.image_size[1], device=distributions[0].device)
        for dist in distributions:
            score_map += F.interpolate(torch.exp(dist - torch.max(dist.view(
                dist.shape[0], -1), dim=1)[0].view(-1, 1, 1, 1)), size=self.image_size, mode="bilinear", align_corners=True)

        score_map = score_map / len(distributions)
        # invert probability to get anomaly score
        anomaly_map = 1 - score_map

        return anomaly_map

class FastFlow(nn.Module):
    def __init__(self, image_size: Tuple[int, int]):
        super(FastFlow, self).__init__()
        self.fe: timm.models.resnet.ResNet = timm.create_model(
            "resnet18", pretrained=True, features_only=True
        )

        for param in self.fe.parameters():
            param.requires_grad = False

        # # resnet 18
        feature_dims = [
            # 64,
            128,
            256
        ]
        image_sizes = [
            # (image_size[0] // (2**2), image_size[1] // (2**2)),
            (image_size[0] // (2**3), image_size[1] // (2**3)),
            (image_size[0] // (2**4), image_size[1] // (2**4)),
        ]

        self.nf = nn.ModuleList(
            [
                fastflow_head(
                    8, 1.2, dim, image_size[0], image_size[1])
                for i, (dim, image_size) in enumerate(zip(feature_dims, image_sizes))
            ]
        )

        nf_params = sum(p.numel()
                        for p in self.nf.parameters() if p.requires_grad)
        print("Params: ", nf_params / 1000000.)

        def initialize_weights(m):
            if isinstance(m, nn.Conv2d):
                nn.init.xavier_normal_(
                    m.weight.data, gain=nn.init.calculate_gain('relu'))
                if m.bias is not None:
                    nn.init.constant_(m.bias.data, 0)

        self.nf.apply(initialize_weights)

        self.anomaly_generator = AnomalyMapGenerator(image_size)

    def forward(self, image: torch.Tensor):
        with torch.no_grad():
            features = self.fe.forward(image)[2:-1]

        anomaly_map: Optional[torch.Tensor] = None
        distributions: List[torch.Tensor] = []
        loss = torch.zeros(1, device=image.device)
        for i, f in enumerate(features):
            z: torch.Tensor
            log_det_jac: torch.Tensor

            z, log_det_jac = self.nf[i](f.detach())
            loss += get_loss(z, log_det_jac)

            with torch.no_grad():
                logpz = z_to_logpz(z)
                distributions.append(logpz.detach())

        with torch.no_grad():
            anomaly_map = self.anomaly_generator.forward(
                distributions).detach()

        return anomaly_map, loss

optimizer:

optimizer = torch.optim.Adam(model.nf.parameters(
), lr=2e-4, betas=(0.8, 0.8), eps=1e-04, weight_decay=1e-5)
maaft commented 2 years ago

I opened my code.

https://github.com/mjack3/FastFlow-AD

There i opened a Q&A section where discuss

I'm getting a 404 - sure that its open?

mjack3 commented 2 years ago

I emailed to my corporation to open the code. But I can show the score image-level that I'm getting. Are you getting something similar?

image

Howeng98 commented 2 years ago

@mjack3 It seems like produce a good result at all. And do you have any idea why the Capsule and Toothbrush classes perform not good with 3 FE layers output flow model ?

Like in general case, screw is hard to perform well since it's align and rotate problems. And also we know transistor's anomaly part is quite different too, so these classes still under our expectation.

mjack3 commented 2 years ago

I think it could be for something that paper sais:

It should be noted that some categories are not suitable for violent data augmentation.

Maybe I'm doing a wrong DA there.

maaft commented 2 years ago

@mjack3 just turn all augmentation off. The paper states only a sub 1% change in performance. And personally, I noticed strong degradation, prediction artifacts and NaN issues when using (maybe "wrong") DA.

mjack3 commented 2 years ago

Thanks for you contribution @maaft

As I said, I was geting NaN tensor using the GLOWCouplingBlock. But if I extract the features with the timm library I stop to achieve NaN tensors.

I was using the https://pytorch.org/vision/stable/feature_extraction.html

mjack3 commented 2 years ago

@maaft what's the math logic behind the z_to_logpz method?

PDT: I'm waiting for my enterprise to open the code

maaft commented 2 years ago

@mjack3 CFlow-AD Eq. 8

I realise that I forgot to add logdet. In praxis it doesn't make a difference since logdet will be positive and we subtract the maximum of logp so p will be in [0, 1].

maaft commented 2 years ago

@mjack3 while we're waiting for your company to release your code, maybe you could comment if you did anything fundamentally different to my model/loss above

mjack3 commented 2 years ago

Sure. Give me a second. I,ll share a piece

mjack3 commented 2 years ago

My model:

My fastflow head is based on AllInOne:

def fastflow_head(dims: tuple) -> Ff.SequenceINN:

    inn = Ff.SequenceINN(*dims)
    for k in range(4):
        inn.append(Fm.AllInOneBlock, subnet_constructor=subnet_conv_3x3, permute_soft=True)
        inn.append(Fm.AllInOneBlock, subnet_constructor=subnet_conv_1x1, permute_soft=True)

    return inn

The loss function I am using in training is just the same as you (based on CS-FLOW (Not CFLOW-AD)

loss = torch.mean(0.5 * torch.sum(z ** 2, dim=(1, 2, 3)) - log_j) / z.shape[1]

and then I am trying to build the anomalies scores directly from Z, similar as CS-FLOW does. Reading the paper, we can see that authors can build a estimate map of anomalies but because of NF is built with dense layers instead convolution, it's not great for anomaly location.

Currrently, reading eq8 from CSFLOW-AD, you piece of code has sense but, we would be mixing different training logic because I think the loss funciont in CSFLOW-AD is different of what you are using (I didn't read the paper in detail yet)

maaft commented 2 years ago

Thank you very much.

Just some notes/questions:

mjack3 commented 2 years ago
maaft commented 2 years ago

Thanks for the other points! :)

mjack3 commented 2 years ago

humm thanks for your point. I haven't so much experience in normalizing flow so, probably you are in true and I'm wrong using AllInOneBlock

maaft commented 2 years ago

I don't think that it matters in the end. Your performance is great.

By the way, I think that realNVP introduced the double sided blocks only to make sure that all channels are trained. They did this before RandomPermute was becoming a thing. Both are equivalent IMHO. Just the number of blocks (when using FrEIA) needs to be adapted to match a certain number of conv layers.

gathierry commented 2 years ago

I think there's conflict about the definition of "one step" in Eq(3) and Fig 2

maaft commented 2 years ago

I don't think so. The figure says "flow step" while eq.3 only says "step" and takes reference to Dhin et al.