SamuelJoutard / Permutohedral_attention_module

Apache License 2.0
29 stars 12 forks source link

PermutohedralLattice.apply #7

Closed sbelharbi closed 3 years ago

sbelharbi commented 3 years ago

hi, here is the second question.

i have an rgb image with height and width h and w. my initial thought is to compute the affinity matrix W then, perform other necessary matrix product. but, i think the whole point to pl is to effciently compoute the prodcut W * IMAGE_OR_SOFTMAXSCORES.

W is of size (h w, h w).

W is the matrix defined here, and eq.4 in here, and as you defined it in your code

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/CRF/crf.py#L92

of crf in here in eq.3 first term, and in eq6 in here.

often, we need to compute St W (1-S) where S is the softmax scores, * is matrix product, t is matrix transpose.

my question now is how to use your PermutohedralLattice.apply to compute either:

  1. St * W
  2. W (1-S) or simply W Z.

i need to be able to perform both, in particular the second operation. Z has the same shape as S. it is useful to in order to compare to another c++ implementation that evaluates first W Z where Z is simple S for a technical reason; then perform the product between St an the result of W Z.

thanks

from your code in here,

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/CRF/crf.py#L102

here what i did, and then i got this error reported in the other post. it is your code, but used over dummy inputs and considering rgb image and the softmax scores has only one plan filled with 1. i expect the output i.e. norm_1 to have the same number of elements as ones:

    np.random.seed(0)
    n, c, h, w = 32, 3, 224, 225

    img = np.random.rand(n, c, h, w) * 255
    img = torch.cuda.FloatTensor(img)
    img = torch.clip(img, 0, 255)

    npx = h * w
    spatial_x, spatial_y= torch.meshgrid(
        torch.arange(h).cuda(),
        torch.arange(w).cuda()

    )
    spatial = torch.stack([spatial_x, spatial_y], dim=0)  # 4d tensor
    # Duplicate the coordinates along the batch dimension
    spatial = spatial.unsqueeze(0).repeat(n, 1, 1, 1)  # 5d tensor
    spatial = spatial.type(torch.cuda.FloatTensor).detach()
    spatial = torch.reshape(spatial, (n, spatial.size(1), -1))
    # Create the bilateral kernel features
    # Features for the first term of eq (3) in [1]
    img_fea = torch.reshape(img, (n, img.size(1), -1))
    _alpha = 1
    _beta = 1
    features_1 = torch.cat([spatial / _alpha, img_fea / _beta], dim=1)
    ones = torch.ones((n, 1, npx)).cuda()
    pl = PermutohedralLattice.apply
    norm_1 = pl(features_1, ones)

thanks for your help

SamuelJoutard commented 3 years ago

Hi Soufiane,

thank you for your interest in our work. I think I can answer more easily this issue than the other one (which seems to have been addressed now thanks to ptrblck. In order to compute your product St W (1 - S), I would compute, using the lattice if you need speed-ups, W (1 - S) first. That means, using the names from my code, that the descriptors (desc) is (1 - S) while the features are what you plan to use to compute W (typically rgb + spatial coordinates). Then, you will obtain something that has the same shape as S and so you can do, as you suggested, the product between S and the output of the previous operation.

I hope this was clear enough, if it is not the case, we can discuss this furthermore through a code exemple.

Samuel

sbelharbi commented 3 years ago

hi Samuel, thank you very much for the details. it is clear.

so, the aim using your implementation is to replace the c++-multi-threaded-cpu implementation of bilateral filter in here. the issue with the cpu implementation is that the c++-multi-threading seems to be turned off when using ddp+multigpus. so the gain in multigpus is lost with the loss of multi-threading. (let me know if you have any suggestion about this. thanks -- this multithreading issue has been solved by setting OMP_NUM_THREADS. thanks)

i did a first comparison between your implementation and theirs where they use a thread at least per sample in minibatch. let's ignore for now the difference in term of value.

i measured the time used to perform W Z when using ddp with single gpu (the ddp has no impact in this case) using your code and theirs on the same input (rgb image (224 224) + segmentation (2 plans) for batch size 32):

  1. multi-threaded-c++-cpu implementation: 65ms
  2. your gpu-implementation: 300ms.

gpu: tesla p100.

note: when using ddp+ 2 gpus, the c++ method slows down to 600ms because its multi-threading is not working. so, the good news is that your gpu implementation is helpful in multi-gpu case. but, it is still very slow.

note: when computing W * Z, i dont need the gradient since i compute it manually later.

q1. did you optimize your code, in term of speed? q2. if not, do you see any opportunity to speed up any part of it? note that all computations to compute the gradient are not necessary.

i'll time the instructions in your code and try to find any bottlenecks.

i really expect that the gpu version will easily run under <= 65ms. please let me know if you have any idea how to speed the code given the above information.

thank you very much for your help!

the used code is similar to this:


import sys
from os.path import dirname, abspath

import re
import torch.nn as nn
import torch
import torch.nn.functional as F
import numpy as np

from PAM_cuda.pl import PermutohedralLattice

def set_seed(seed):
    torch.manual_seed(seed)
    np.random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

def main():
    seed = 0
    cuda = "0"
    print("cuda:{}".format(cuda))
    DEVICE = torch.device(
        "cuda:{}".format(cuda) if torch.cuda.is_available() else "cpu")

    set_seed(seed=seed)
    torch.backends.cudnn.benchmark = True
    # Deterministic mode can have a performance impact, depending on your
    torch.backends.cudnn.deterministic = True

    n, c, h, w = 32, 3, 224, 224
    nbr_cl = 2

    img = torch.randint(
        low=0, high=256,
        size=(n, 3, h, w), dtype=torch.float, device=DEVICE,
        requires_grad=False)
    segmentations = torch.rand(size=(n, nbr_cl, h, w), dtype=torch.float,
                               device=DEVICE,
                               requires_grad=True)
    segmentations = torch.softmax(segmentations, dim=1)
    segmentations = segmentations.view(n, nbr_cl, -1)

    npx = h * w
    spatial_x, spatial_y = torch.meshgrid(
        torch.arange(h, device=DEVICE),
        torch.arange(w, device=DEVICE)
    )
    spatial = torch.stack([spatial_x, spatial_y], dim=0)  # 4d tensor
    # Duplicate the coordinates along the batch dimension
    spatial = spatial.unsqueeze(0).repeat(n, 1, 1, 1)  # 5d tensor
    spatial = spatial.float().detach()
    spatial = torch.reshape(spatial, (n, spatial.size(1), -1))
    # Create the bilateral kernel features
    # Features for the first term of eq (3) in [1]
    img_fea = torch.reshape(img, (n, img.size(1), -1))
    _alpha = 100.
    _beta = 15.
    features_1 = torch.cat([spatial / _alpha, img_fea / _beta], dim=1)

    pl = PermutohedralLattice.apply
    torch.cuda.synchronize()
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    with torch.no_grad():
        norm_1 = pl(features_1, segmentations)
    end_event.record()
    torch.cuda.synchronize()

    elapsed_time_ms = start_event.elapsed_time(end_event)
    print('time op: {}'.format(elapsed_time_ms))

if __name__ == '__main__':
    for i in range(10):
        print('run {}'.format(i))
        main()
sbelharbi commented 3 years ago

so, i did a timing of forward pl(features_1, segmentations) above which takes around 320ms

  1. PermutohedralLattice.prepare took: 309ms https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L20

  2. PermutohedralLattice.permutohedral_compute took: 11ms https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L21

so PermutohedralLattice.prepare takes the cake.

i timed large parts in PermutohedralLattice.prepare: A. Init: (issue is here) from

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L63

to https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L109

time : 283.8191223144531 ms

B. loop1 from

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L110

to

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L118

time : 21.377792358398438 ms

C. HT_opp.get_values: 0.9110080003738403 ms

D. lopo2: from

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L121

to

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L135

time : 3.7452800273895264 ms

E. Ending

from

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L136

to

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L144

time : 0.9512959718704224 ms


this operation that create a huge zero matrix takes 244ms!!!!

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L105

also, there are many numy-to-pytorch conversion + cpu to gpu transfers that could be reduced.

by creating the matrix table directly in the gpu, the time of that instruction drops to 2.5ms.

device = hash_vector.device
table = torch.zeros((table_size, n_ch_1), device=device, dtype=torch.int32) - 2

even casting huge tensor could take 2ms.

now the total forward pl(features_1, segmentations) is 82ms (from 320ms). it is too bad we cant instantiate torch.autograd.Function because some stuff are constant such as the table. but, we still can store them outside and provide them in the forward call as arguments.

i think i can drop the call time below 70ms. i'll try and let you know.

thank you again!

sbelharbi commented 3 years ago

i tried to see if there are some corners to cut, nothing. here is the gist of bottlenecks in pl(features_1, segmentations) which takes in total ~82ms:

i tired to replace all numpy ops by torch ops but it didnt bring much benefits as the tensors are small.

i just did a quick pr with minor modification in

https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L105

by:

device = hash_vector.device
table = torch.zeros((table_size, n_ch_1), device=device, dtype=torch.int32) - 2

thank you for your quick reactivity and your help! closing

best