Hello, wonderful work! I wonder what if we set use_shared_rel_pos_bias=True, in the sense that the relative pos bias table is shared across layers. There will be two ablations: A. inject bias at each attention layer. B. inject bias only at input. What will the performance of A, B, and layer-specific-rel-pos be like?
Hello, wonderful work! I wonder what if we set
use_shared_rel_pos_bias=True
, in the sense that the relative pos bias table is shared across layers. There will be two ablations: A. inject bias at each attention layer. B. inject bias only at input. What will the performance of A, B, and layer-specific-rel-pos be like?