deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
MIT License
3.6k stars 153 forks source link

Drop Token #48

Closed Richie-yan closed 5 months ago

Richie-yan commented 5 months ago

hello @DeepSeekDDM @luofuli I have some question in drop token deepseek v2 Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension? If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens topk) / num_experts capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity. If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens topk) / num_groups capacity_factor? How is the token dropping executed in this case? Because the paper mentions device-level token dropping, I have the above confusion.

Richie-yan commented 5 months ago

Adding another question: How should I understand the statement from the paper that "we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped"? Is there a specific strategy implemented during token dropping to enforce this? @DeepSeekDDM @luofuli

DeepSeekDDM commented 5 months ago

Refer to this issue: https://github.com/deepseek-ai/DeepSeek-V2/issues/5