Open ramdhan1989 opened 1 year ago
Hi @ramdhan1989 ,
Sorry for replying late. You can modify the codeline in Line121, yolo.py from this:
yi = self.forward_once(xi)[0] # forward
to this:
yi = self.forward_once(xi)[0][0] # forward
The error is raised because in DRENet, we also return the degraded reconstruction image in def forward_once().
============
It seems that you want to leverage the multi-scale inference by setting augment=True. However, I'm afraid that current C3ResAtnMHSA structure may not support different input size (because of the fixed-size positional encoding).
Thus, if you want to use multi-scale inference, you may either consider to replace the C3ResAtnMHSA, or change the current C3ResAtnMHSA structure. For the structure change, maybe you can modify the current fix-size positional encoding into a adaptive one (maybe by bilinear interpolation?).
You can have a try.
noted, thank you
Hi @WindVChen , I am interested to modify the code to accommodate different image size. In my opinion, it would be beneficial to improve performance by applying inference using augmentation and also doing inference using original image size to capture larger objects in addition to inference on sliced images. would you mind guiding me how can I start to do modification? do I need to change only the part below?
class C3ResAtnMHSA(nn.Module):
# CSP Bottleneck with 3 convolutions
def __init__(self, c1, c2, n=1, size=14, shortcut=True, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super(C3ResAtnMHSA, self).__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = Conv(2 * c_, c2, 1) # act=FReLU(c2)
self.m = nn.Sequential(*[BottleneckResAtnMHSA(c_, size, shortcut=True) for _ in range(n)])
# self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])
def forward(self, x):
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
Thanks Regards,
Ramdhan
Actually you only need to change the following part:
class BottleneckResAtnMHSA(nn.Module):
# Standard bottleneck
def __init__(self, n_dims, size, shortcut=True): # ch_in, ch_out, shortcut, groups, expansion
super(BottleneckResAtnMHSA, self).__init__()
height=size
width=size
self.cv1 = Conv(n_dims, n_dims//2, 1, 1)
self.cv2 = Conv(n_dims//2, n_dims, 1, 1)
'''MHSA PARAGRAMS'''
self.query = nn.Conv2d(n_dims//2, n_dims//2, kernel_size=1)
self.key = nn.Conv2d(n_dims//2, n_dims//2, kernel_size=1)
self.value = nn.Conv2d(n_dims//2, n_dims//2, kernel_size=1)
self.rel_h = nn.Parameter(torch.randn([1, n_dims//2, height, 1]), requires_grad=True)
self.rel_w = nn.Parameter(torch.randn([1, n_dims//2, 1, width]), requires_grad=True)
self.softmax = nn.Softmax(dim=-1)
self.add = shortcut
def forward(self, x):
x1=self.cv1(x)
n_batch, C, width, height = x1.size()
q = self.query(x1).view(n_batch, C, -1)
k = self.key(x1).view(n_batch, C, -1)
v = self.value(x1).view(n_batch, C, -1)
content_content = torch.bmm(q.permute(0, 2, 1), k)
content_position = (self.rel_h + self.rel_w).view(1, C, -1).permute(0, 2, 1)
content_position = torch.matmul(content_position, q)
energy = content_content + content_position
attention = self.softmax(energy)
out = torch.bmm(v, attention.permute(0, 2, 1))
out = out.view(n_batch, C, width, height)
return x + self.cv2(out) if self.add else self.cv2(out)
More specifically, we can find from the previous issues, that the errors (due to input resolutions) are usually come from this part:
def forward(self, x):
...
content_position = (self.rel_h + self.rel_w).view(1, C, -1).permute(0, 2, 1)
content_position = torch.matmul(content_position, q)
energy = content_content + content_position
...
And it is because that self.rel_h and self.rel_w is of fixed size by the settings in DRE.yaml
def __init__(self, n_dims, size, shortcut=True):
...
self.rel_h = nn.Parameter(torch.randn([1, n_dims//2, height, 1]), requires_grad=True)
self.rel_w = nn.Parameter(torch.randn([1, n_dims//2, 1, width]), requires_grad=True)
...
Since we find the problem above, a straightforward solution is to make the following line resolution-adaptive:
content_position = (self.rel_h + self.rel_w).view(1, C, -1).permute(0, 2, 1)
My opinion is to add a codeline that interpolate
self.rel_h and self.rel_w according to the input size under def forward()
. Then it will support inputs of different resolutions in the inference, and in the training, there will be no need to change DRENet.yaml every time the input resolution is changed.
Since I am not sure whether this solution (somewhat brute) can make good results, I will be very glad that you can share the experimental results with me whether it is effective.
Hi, I have printed the vector size for every step in BottleneckResAtnMHSA and C3ResAtnMHSA class inside common.py and got the summary below : <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
size | rel h | rel w | position_content | content_content -- | -- | -- | -- | -- 512×512 | (192×16×1) | (192×1×16) | (256×256) | (256×256) 1024×1024 | (192×16×1) | (192×1×16) | (256×1024) | (1024×1024) 1024×512 | (192×16×1) | (192×1×16) | (256×512) | (512×512) 640×640 | (192×16×1) | (192×1×16) | (256×400) | (400×400)
Hi, I got error when doing inference using augment=True. the error is shown as follow. please advise
Thanks