Closed sbelharbi closed 3 years ago
Hi Soufiane,
thank you for your interest in our work. I think I can answer more easily this issue than the other one (which seems to have been addressed now thanks to ptrblck. In order to compute your product St W (1 - S), I would compute, using the lattice if you need speed-ups, W (1 - S) first. That means, using the names from my code, that the descriptors (desc) is (1 - S) while the features are what you plan to use to compute W (typically rgb + spatial coordinates). Then, you will obtain something that has the same shape as S and so you can do, as you suggested, the product between S and the output of the previous operation.
I hope this was clear enough, if it is not the case, we can discuss this furthermore through a code exemple.
Samuel
hi Samuel, thank you very much for the details. it is clear.
so, the aim using your implementation is to replace the c++-multi-threaded-cpu implementation of bilateral filter in here.
the issue with the cpu implementation is that the c++-multi-threading seems to be turned off when using ddp+multigpus. so the gain in multigpus is lost with the loss of multi-threading. (let me know if you have any suggestion about this. thanks -- this multithreading issue has been solved by setting OMP_NUM_THREADS
. thanks)
i did a first comparison between your implementation and theirs where they use a thread at least per sample in minibatch. let's ignore for now the difference in term of value.
i measured the time used to perform W Z when using ddp with single gpu (the ddp has no impact in this case) using your code and theirs on the same input (rgb image (224 224) + segmentation (2 plans) for batch size 32):
gpu: tesla p100.
note: when using ddp+ 2 gpus, the c++ method slows down to 600ms because its multi-threading is not working. so, the good news is that your gpu implementation is helpful in multi-gpu case. but, it is still very slow.
note: when computing W * Z, i dont need the gradient since i compute it manually later.
q1. did you optimize your code, in term of speed? q2. if not, do you see any opportunity to speed up any part of it? note that all computations to compute the gradient are not necessary.
i'll time the instructions in your code and try to find any bottlenecks.
i really expect that the gpu version will easily run under <= 65ms. please let me know if you have any idea how to speed the code given the above information.
thank you very much for your help!
the used code is similar to this:
import sys
from os.path import dirname, abspath
import re
import torch.nn as nn
import torch
import torch.nn.functional as F
import numpy as np
from PAM_cuda.pl import PermutohedralLattice
def set_seed(seed):
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
def main():
seed = 0
cuda = "0"
print("cuda:{}".format(cuda))
DEVICE = torch.device(
"cuda:{}".format(cuda) if torch.cuda.is_available() else "cpu")
set_seed(seed=seed)
torch.backends.cudnn.benchmark = True
# Deterministic mode can have a performance impact, depending on your
torch.backends.cudnn.deterministic = True
n, c, h, w = 32, 3, 224, 224
nbr_cl = 2
img = torch.randint(
low=0, high=256,
size=(n, 3, h, w), dtype=torch.float, device=DEVICE,
requires_grad=False)
segmentations = torch.rand(size=(n, nbr_cl, h, w), dtype=torch.float,
device=DEVICE,
requires_grad=True)
segmentations = torch.softmax(segmentations, dim=1)
segmentations = segmentations.view(n, nbr_cl, -1)
npx = h * w
spatial_x, spatial_y = torch.meshgrid(
torch.arange(h, device=DEVICE),
torch.arange(w, device=DEVICE)
)
spatial = torch.stack([spatial_x, spatial_y], dim=0) # 4d tensor
# Duplicate the coordinates along the batch dimension
spatial = spatial.unsqueeze(0).repeat(n, 1, 1, 1) # 5d tensor
spatial = spatial.float().detach()
spatial = torch.reshape(spatial, (n, spatial.size(1), -1))
# Create the bilateral kernel features
# Features for the first term of eq (3) in [1]
img_fea = torch.reshape(img, (n, img.size(1), -1))
_alpha = 100.
_beta = 15.
features_1 = torch.cat([spatial / _alpha, img_fea / _beta], dim=1)
pl = PermutohedralLattice.apply
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
with torch.no_grad():
norm_1 = pl(features_1, segmentations)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
print('time op: {}'.format(elapsed_time_ms))
if __name__ == '__main__':
for i in range(10):
print('run {}'.format(i))
main()
so, i did a timing of forward pl(features_1, segmentations)
above which takes around 320ms
PermutohedralLattice.prepare
took: 309ms
https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L20
PermutohedralLattice.permutohedral_compute
took: 11ms
https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/PAM_cuda/pl.py#L21
so PermutohedralLattice.prepare
takes the cake.
i timed large parts in PermutohedralLattice.prepare
:
A. Init: (issue is here)
from
time : 283.8191223144531 ms
B. loop1 from
to
time : 21.377792358398438 ms
C. HT_opp.get_values:
0.9110080003738403 ms
D. lopo2: from
to
time : 3.7452800273895264 ms
E. Ending
from
to
time : 0.9512959718704224 ms
this operation that create a huge zero matrix takes 244ms!!!!
also, there are many numy-to-pytorch conversion + cpu to gpu transfers that could be reduced.
by creating the matrix table
directly in the gpu, the time of that instruction drops to 2.5ms.
device = hash_vector.device
table = torch.zeros((table_size, n_ch_1), device=device, dtype=torch.int32) - 2
even casting huge tensor could take 2ms.
now the total forward pl(features_1, segmentations)
is 82ms (from 320ms).
it is too bad we cant instantiate torch.autograd.Function
because some stuff are constant such as the table.
but, we still can store them outside and provide them in the forward call as arguments.
i think i can drop the call time below 70ms. i'll try and let you know.
thank you again!
i tried to see if there are some corners to cut, nothing.
here is the gist of bottlenecks in pl(features_1, segmentations)
which takes in total ~82ms:
i tired to replace all numpy ops by torch ops but it didnt bring much benefits as the tensors are small.
i just did a quick pr with minor modification in
by:
device = hash_vector.device
table = torch.zeros((table_size, n_ch_1), device=device, dtype=torch.int32) - 2
thank you for your quick reactivity and your help! closing
best
hi, here is the second question.
i have an rgb image with height and width h and w. my initial thought is to compute the affinity matrix W then, perform other necessary matrix product. but, i think the whole point to pl is to effciently compoute the prodcut W * IMAGE_OR_SOFTMAXSCORES.
W is of size (h w, h w).
W is the matrix defined here, and eq.4 in here, and as you defined it in your code
https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/CRF/crf.py#L92
of crf in here in eq.3 first term, and in eq6 in here.
often, we need to compute St W (1-S) where S is the softmax scores, * is matrix product, t is matrix transpose.
my question now is how to use your
PermutohedralLattice.apply
to compute either:i need to be able to perform both, in particular the second operation. Z has the same shape as S. it is useful to in order to compare to another c++ implementation that evaluates first W Z where Z is simple S for a technical reason; then perform the product between St an the result of W Z.
thanks
from your code in here,
https://github.com/SamuelJoutard/Permutohedral_attention_module/blob/c86b8108fbfcf73ce300197e57cccbdfa25386ff/CRF/crf.py#L102
here what i did, and then i got this error reported in the other post. it is your code, but used over dummy inputs and considering rgb image and the softmax scores has only one plan filled with 1. i expect the output i.e.
norm_1
to have the same number of elements asones
:thanks for your help