Hi, thanks for your hard work. I read the paper and if I understand correctly, the first transformer block doesn't have any positional information. would this cause any issues for passing on information to the rest of the blocks, since the self attention modules always come the some positional information? have you tried to use any other relative positional encoding methods to fill in the gap for the first block?
Hi, thanks for your hard work. I read the paper and if I understand correctly, the first transformer block doesn't have any positional information. would this cause any issues for passing on information to the rest of the blocks, since the self attention modules always come the some positional information? have you tried to use any other relative positional encoding methods to fill in the gap for the first block?