Open thefacetakt opened 3 months ago
@symphonylyh Could you please take a look? Thanks
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
@thefacetakt correct. this is to be more consistent with the fused QKV gemm used in all models, but indeed will bring some redundant computation.
However, I think it won't be very significant in the entire computation:
Do you agree?
We can still investigate on (1), to see whether it's critical enough. Otherwise, maybe keeping consistent with other models with <1% slow down is acceptable.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
As far as i understand, when using cross_attention we first compute
qkv = self.qkv(hidden_states)
, and thencross_qkv = self.qkv(encoder_output)
. But later onlyq
fromqkv
is used, and onlykv
fromcross_qkv
is used.Seems like wasteful computation. Perhaps, there should be two matrices: q and kv. (This will also improve quantization a bit, since there will be separate scales)