Hello, I'm hoping you can help me understand why is the dimension of A TDN, TDH, since the dimension of hidden_states is TD1, it becomes TD after dense.
ATA += W * (hidden_states @ hidden_states.t() ) #1956,1956 Why did ATA calculate it this way?
Can you recommend materials to solve this problem? Looking forward to your reply.
Hello, I'm hoping you can help me understand why is the dimension of A TDN, TDH, since the dimension of hidden_states is TD1, it becomes TD after dense.
ATA += W * (hidden_states @ hidden_states.t() ) #1956,1956 Why did ATA calculate it this way? Can you recommend materials to solve this problem? Looking forward to your reply.