the Preprocessing output: (B784, 2, 1024), it is calculated below
torch.stack( [ (B784, 1024), (B784,1024) ], dim=1 ) , where (B784, 1024) and (B*784,1024) represent the output of MeanMapper for layer2 and layer3
and in the Aggregator, after a reshape to (B784, 1, 2048), it uses a adaptive_average_pool1d(target_dim=1024) to get an output of (B784, 1, 1024), is that to calculate average about layer2 and layer3 at same dimension, that cannot reach this performance.
a example [1,2,3,4,5], [1,1,1,1,1] stack to [ [1,2,3,4,5],[1,1,1,1,1] ] reshape to [[1,2,3,4,5,1,1,1,1,1]] and adaptive_average_pool1d, that is computed in line, (1+2)/2=1.5 (3+4)/2, not to compute (1+1)/2, (2+1)/2, (3+1)/2, the same place at each. maybe not good example
mayebe i am wrong, please tell me, thank you for your help!
the Preprocessing output: (B784, 2, 1024), it is calculated below torch.stack( [ (B784, 1024), (B784,1024) ], dim=1 ) , where (B784, 1024) and (B*784,1024) represent the output of MeanMapper for layer2 and layer3
and in the Aggregator, after a reshape to (B784, 1, 2048), it uses a adaptive_average_pool1d(target_dim=1024) to get an output of (B784, 1, 1024), is that to calculate average about layer2 and layer3 at same dimension, that cannot reach this performance. a example [1,2,3,4,5], [1,1,1,1,1] stack to [ [1,2,3,4,5],[1,1,1,1,1] ] reshape to [[1,2,3,4,5,1,1,1,1,1]] and adaptive_average_pool1d, that is computed in line, (1+2)/2=1.5 (3+4)/2, not to compute (1+1)/2, (2+1)/2, (3+1)/2, the same place at each. maybe not good example
mayebe i am wrong, please tell me, thank you for your help!