facebookresearch / DeltaCNN

DeltaCNN End-to-End CNN Inference of Sparse Frame Differences in Videos
Other
60 stars 5 forks source link

Issues when adding layers not included in DeltaCNN #6

Closed kai981 closed 1 year ago

kai981 commented 1 year ago

Hi, I am trying to add the layer BatchNormAct2d, as shown below, into a DeltaCNN model:

class BatchNormAct2d(nn.BatchNorm2d):
    def __init__(
            self,
            num_features,
            eps=1e-5,
            momentum=0.1,
            affine=True,
            track_running_stats=True,
            apply_act=True,
            act_layer=nn.ReLU,
            inplace=True,
            drop_layer=None,
            device=None,
            dtype=None
    ):
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(BatchNormAct2d, self).__init__(
            num_features, eps=eps, momentum=momentum, affine=affine, track_running_stats=track_running_stats,
            **factory_kwargs
        )

        self.drop = nn.Identity()
        if act_layer is not None and apply_act:
            act_args = dict(inplace=True) if inplace else {}
            self.act = act_layer(**act_args)
        else:
            self.act = nn.Identity()

    def forward(self, x):
        _assert(x.ndim == 4, f'expected 4D input (got {x.ndim}D input)')

        if self.momentum is None:
            exponential_average_factor = 0.0
        else:
            exponential_average_factor = self.momentum

        if self.training and self.track_running_stats:
            if self.num_batches_tracked is not None:  
                self.num_batches_tracked = self.num_batches_tracked + 1  
                if self.momentum is None:  
                    exponential_average_factor = 1.0 / float(self.num_batches_tracked)
                else: 
                    exponential_average_factor = self.momentum

        if self.training:
            bn_training = True
        else:
            bn_training = (self.running_mean is None) and (self.running_var is None)

        x = F.batch_norm(
            x,
            self.running_mean if not self.training or self.track_running_stats else None,
            self.running_var if not self.training or self.track_running_stats else None,
            self.weight,
            self.bias,
            bn_training,
            exponential_average_factor,
            self.eps,
        )
        x = self.drop(x)
        x = self.act(x)
        return x

My attempt to modify it is as shown below:

class DCIdentity(DCModule):
    def __init__(self) -> None:
        super(DCIdentity, self).__init__()

    def forward(self, x):
        return x

class DCBatchNormAct2d(DCBatchNorm2d): 
    def __init__(
            self,
            num_features,
            eps=1e-5,
            momentum=0.1,
            affine=True,
            track_running_stats=True,
            apply_act=True,
            act_layer="relu",
            inplace=True,
            drop_layer=None,
            device=None,
            dtype=None
    ):
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(DCBatchNormAct2d, self).__init__(
            num_features, eps=eps, momentum=momentum, affine=affine, track_running_stats=track_running_stats,
            **factory_kwargs
        )
        self.densify = DCDensify()  
        self.drop = DCIdentity()  
        if act_layer is not None and apply_act:
            self.act = DCActivation(act_layer, inplace)
        else:
            self.act = DCIdentity() 
        self.sparsify = DCSparsify() 

    def forward(self, x):
        x = self.densify(x)  
        _assert(x.ndim == 4, f'expected 4D input (got {x.ndim}D input)')

        if self.momentum is None:
            exponential_average_factor = 0.0
        else:
            exponential_average_factor = self.momentum

        bn_training = (self.running_mean is None) and (self.running_var is None)

        x = F.batch_norm(  
            x,
            self.running_mean if not self.training or self.track_running_stats else None,
            self.running_var if not self.training or self.track_running_stats else None,
            self.weight,
            self.bias,
            bn_training,
            exponential_average_factor,
            self.eps,
        )
        x = self.sparsify(x)  
        x = self.drop(x)
        x = self.act(x)
        return x  

However, I have encountered the following CUDA error (the class DCBatchNormAct2d is in norm_act_dc.py):

Traceback (most recent call last):
  File "/home/efficientdet/validate_deltacnn.py", line 251, in <module>
    main()
  File "/home/efficientdet/validate_deltacnn.py", line 247, in main
    validate(args)
  File "/home/efficientdet/validate_deltacnn.py", line 217, in validate
    output_dc = dc_model(input, img_info=target)  
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/effdet/bench_deltacnn.py", line 104, in forward
    class_out, box_out = self.model(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/effdet/efficientdet_deltacnn.py", line 577, in forward
    x = self.backbone(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/timm/models/efficientnet_deltacnn.py", line 240, in forward
    x = b(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/timm/models/efficientnet_blocks_dc.py", line 148, in forward
    x = self.bn2(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/timm/models/layers/norm_act_dc.py", line 72, in forward
    x = self.densify(x)  
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/src/deltacnn/sparse_layers.py", line 1308, in forward
    self.prev_out = torch.zeros_like(input)
RuntimeError: CUDA error: an illegal memory access was encountered

I would like to seek advice on whether my modification to the layer is correct? Thank you so much!

dabeschte commented 1 year ago

Hi! Great to see that you are using our code in your project. On first sight, everything seems legit. I suspect that the error is happening earlier and it is simply not detect on time. Can you try running the code with the environment variable CUDA_LAUNCH_BLOCKING=1? This should help to find the error where it actually happens.

Another hint: this layer is only a batchnorm and activation combined. You could use DeltaCNN's BatchNorm2D implementation here to avoid the costly conversions from sparse to dense. Our implementation even simplifies the operation to a simple multiply-add only applied to active pixels.

kai981 commented 1 year ago

Hi, thanks for the suggestion, I have changed the layer to use DeltaCNN's BatchNorm2D. However I am still encountering the same CUDA error.

Traceback (most recent call last):
  File "/home/efficientdet/validate_deltacnn.py", line 256, in <module>
    main()
  File "/home/efficientdet/validate_deltacnn.py", line 252, in main
    validate(args)
  File "/home/efficientdet/validate_deltacnn.py", line 222, in validate
    output_dc = dc_model(input, img_info=target)  
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/effdet/bench_deltacnn.py", line 104, in forward
    class_out, box_out = self.model(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/effdet/efficientdet_deltacnn.py", line 577, in forward
    x = self.backbone(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/timm/models/efficientnet_deltacnn.py", line 243, in forward
    x = b(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/timm/models/efficientnet_blocks_dc.py", line 151, in forward
    x = self.bn2(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/timm/models/layers/norm_act_dc.py", line 55, in forward
    x = self.bn(x)
  File "/home/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/efficientdet/src/deltacnn/sparse_layers.py", line 1557, in forward
    self.convert_to_scale_offset(input)
  File "/home/efficientdet/src/deltacnn/sparse_layers.py", line 1535, in convert_to_scale_offset
    bn_scale = self.weight * torch.rsqrt(self.running_var + self.eps)
RuntimeError: CUDA error: an illegal memory access was encountered

Even after trying to add the environment variable by running CUDA_LAUNCH_BLOCKING=1 python validate_deltacnn.py the exact same Traceback is produced. Does it mean that I am not utilizing the environment variable correctly?

The modified layer is as shown:

class DCBatchNormAct2d(DCModule): 
    def __init__(
            self,
            num_features,
            eps=1e-5,
            momentum=0.1,
            affine=True,
            track_running_stats=True,
            apply_act=True,
            act_layer="relu",
            inplace=True,
            drop_layer=None
    ):
        super(DCBatchNormAct2d, self).__init__()
        self.bn = DCBatchNorm2d(num_features=num_features, eps=eps, momentum=momentum, affine=affine,
                                track_running_stats=track_running_stats)
        self.drop = drop_layer() if drop_layer is not None else DCIdentity()  
        if act_layer is not None and apply_act:
            self.act = DCActivation(act_layer, inplace)
        else:
            self.act = DCIdentity() 

    def forward(self, x):
        x = self.bn(x)
        x = self.drop(x)
        x = self.act(x)
        return x  
dabeschte commented 1 year ago

The new, modified layer looks great. But I still think that the error is happening at some earlier point. I just hoped that this environment variable would help us to identify where the problem lies. How far down the model is this layer called? I assume it is e.g. the second layer that is called in the model and the first layer is a convolution? The most likely problem is that you did not convert the parameters (process_filters according to https://github.com/facebookresearch/DeltaCNN#33-weights-and-features-memory-layout).

Another way to look for errors is to call torch.cuda.synchronize() at various points before this error appeared. I think that you might the problem in the very first layer already.

kai981 commented 1 year ago

Hi, I would like to update on what I have attempted and some new questions. I have tried to call torch.cuda.synhcronize() at various points and it seemed to stop at the squeeze and excite layer. I suspect the reason for the error might be because I am using too many dense to sparse and sparse to dense conversions, so I tried to make some modifications. Instead of doing sparse to dense then x.mean((2,3), keepdim=True) then dense to sparse, I replaced it with DCAdaptiveAveragePooling(1). However, for the last part of the layer which involves scaling the feature map x (self.gate is a sigmoid function), I am not sure how to get rid of the sparse to dense conversions. Must the multiplication be performed using dense operations?

class DCSqueezeExcite(DCModule):

    def __init__(
            self, in_chs, rd_ratio=0.25, rd_channels=None, act_layer="relu",
            gate_layer="sigmoid", force_act_layer=None, rd_round_fn=None):
        super(DCSqueezeExcite, self).__init__()
        if rd_channels is None:
            rd_round_fn = rd_round_fn or round
            rd_channels = rd_round_fn(in_chs * rd_ratio)
        act_layer = force_act_layer or act_layer
        self.densify = DCDensify()  ## added 
        self.gap = DCAdaptiveAveragePooling(1)
        #self.conv_reduce = nn.Conv2d(in_chs, rd_channels, 1, bias=True) replaced by 
        self.conv_reduce = DCConv2d(in_chs, rd_channels, 1, bias=True)
        #self.act1 = create_act_layer(act_layer, inplace=True) replaced by
        self.act1 = DCActivation(activation=act_layer, inplace=True)
        #self.conv_expand = nn.Conv2d(rd_channels, in_chs, 1, bias=True) replaced by 
        self.conv_expand = DCConv2d(rd_channels, in_chs, 1, bias=True)
        self.gate = create_act_layer(gate_layer) #replaced by 
        #self.gate = DCActivation(activation=gate_layer)
        self.sparsify = DCSparsify()  ## added 

    def forward(self, x):
        #x_se = x.mean((2, 3), keepdim=True)  ## acts as global average pooling replaced by 
        x_se = self.gap(x)
        x_se = self.conv_reduce(x_se)
        x_se = self.act1(x_se)
        x_se = self.conv_expand(x_se)
        x_se = self.densify(x_se)
        x = self.densify(x)
        return self.sparsify(x * self.gate(x_se))

Thank you so much for your prompt replies!! Really do appreciate that :)

dabeschte commented 1 year ago

ah, I see the problem. you are using the same densify object on two different tensors. I probably did not document this well enough. but every sparsify, densify, maxpool and activation layer needs its own instance, because it accumulates incoming updates in this stage of the network. densify clones the input the first time it is called and then adds all incoming updates onto this input in upcoming frames. that means that when you use it on different stages, it will not contain the correct state. luckily, the shapes did not match here and you therefore got an exception. in other cases, you might not even have noticed the error, but the result would silently be wrong. so, to solve this issue: use two different densify objects for the two different values.

and if you want to save a tiny bit of performance, I'd suggest to not densify and sparsify x at all. You only need to densify x_se in this case. Multiplying the other values is a linear operation and can directly be applied to the all values (active of not).

class DCSqueezeExcite(DCModule):

    def __init__(
            self, in_chs, rd_ratio=0.25, rd_channels=None, act_layer="relu",
            gate_layer="sigmoid", force_act_layer=None, rd_round_fn=None):
        super(DCSqueezeExcite, self).__init__()
        if rd_channels is None:
            rd_round_fn = rd_round_fn or round
            rd_channels = rd_round_fn(in_chs * rd_ratio)
        act_layer = force_act_layer or act_layer
        self.gap = DCAdaptiveAveragePooling(1)
        #self.conv_reduce = nn.Conv2d(in_chs, rd_channels, 1, bias=True) replaced by 
        self.conv_reduce = DCConv2d(in_chs, rd_channels, 1, bias=True)
        #self.act1 = create_act_layer(act_layer, inplace=True) replaced by
        self.act1 = DCActivation(activation=act_layer, inplace=True)
        #self.conv_expand = nn.Conv2d(rd_channels, in_chs, 1, bias=True) replaced by 
        self.conv_expand = DCConv2d(rd_channels, in_chs, 1, bias=True, activation=gate_layer, dense_out=True)

    def forward(self, x):
        #x_se = x.mean((2, 3), keepdim=True)  ## acts as global average pooling replaced by 
        x_se = self.gap(x)
        x_se = self.conv_reduce(x_se)
        x_se = self.act1(x_se)
        x_se = self.conv_expand(x_se)
        x[0].mul_(x_se)
        return x

I hope this works for you. Please keep me updated :)

kai981 commented 1 year ago

Hi, I have some new questions again :)

I have managed to run without errors now, though I have not checked if the outputs are produced correctly yet. To clarify, any layers with nonlinear functions (like max pool and activation) and also densify and sparsify layers would require its own instance, other layers have no issues referring to the same instance as they do not need to store accumulated values? For custom layers having a few inner layers, for instance, the SqueezeExcite layer (with some convolution, pooling, activation layers within it), do I have to create new instance for each one as well?

The paper shows results for various batch sizes, however, from my understanding, DeltaCNN would require accumulated values from the previous frame to process the current frame (other than the first frame), how does it work for batch size > 1?

The AP scores for MOT16, is it based on the detection of just the pedestrians (one category)?

Thank you!

dabeschte commented 1 year ago

Good to hear.

Yes, that is correct. All non-linear layers need their unique instance while linear layers can be reused (but that happens rarely). However, I think EfficientDet does reuse some convolutions multiple times and that is not an issue.

Yes, if you want to use your SqueezeExcite layer multiple times, you need multiple instances of the entire layer.

There are multiple ways to increase the batch size: you can either process multiple videos at once or split a video up into sequences with each e.g. 100 frames and then process all sequences in parallel. The easiest way to increase the batch size, however, is to simply repeat the current input X times. In other words, the same videos is processed X times simultaneously. This is what I used in my evaluation, because the other approaches often result in difference sequence lengths and are more complex to parallelize correctly.

The AP scores include all classes. However, I can't remember if I found a way to handle the "reflection" class well. I think I just lowered the weight of this class and all occluder classes. However, you can train it as you think is the best approach for your use case and then use the same weights for DeltaCNN.