Since there is not code example provided for pytorch, I wrote my own version below. But it turned out it gave different result than using the huggingface model. Anything wrong with my code?
Self-answer: turns out q_scaling should be sqrt(3.0) instead of 1.0, but I don't really understand the underlying reason. Is it because Deberta has c2c, c2p, p2c while Bert only has c2c?
Since there is not code example provided for pytorch, I wrote my own version below. But it turned out it gave different result than using the huggingface model. Anything wrong with my code?