astramind-ai / Mixture-of-depths

Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
129 stars 7 forks source link

Question about the impl #11

Open GeneZC opened 3 months ago

GeneZC commented 3 months ago

I have several questions concerning the implementation:

  1. for flash attn, why there is not a slicing op to index the i-th element

https://github.com/astramind-ai/Mixture-of-depths/blob/aff9e74fc9c5a30d2c59dc36767f1f0fd86255e8/MoD/MoD.py#L73

  1. for positions ids, why there is a slicing op to index the i-th element given that the size of the first dimension from position ids should always be 1.

https://github.com/astramind-ai/Mixture-of-depths/blob/aff9e74fc9c5a30d2c59dc36767f1f0fd86255e8/MoD/MoD.py#L70

xinlong-yang commented 3 months ago

Hi, I have the same question for 1, and I'm not sure whether following operation will solve it? like in LlamaAttention class; for question 2., the position id's shape is [bs, seqlen] I think? so it needs a index.