Question

When i tried to evaluate LanguageBind/MoE-LLaVA-Phi2-2.7B-4e model, the evaluation results vary every time. (e.g. 1st: 61.42, 2nd: 61.32, 3nd: 61.22 for GQA evaluation)

I attempted to debug and realized that the routing results in MoE layer vary with each inference. In GQA evaluation, the sample data below results in either red or brown. {"question_id": "2059565", "image": "n130638.jpg", "text": "What color is the dirt?\nAnswer the question using a single word or phrase.", "category": "default"}

I print self.deepspeed_moe.exp_counts in https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/moe/layer.py#L132

======================================= Case1:

tensor([ 63, 140, 157, 270]) tensor([ 9, 185, 0, 436]) tensor([211, 44, 49, 326]) tensor([134, 308, 11, 177]) tensor([351, 203, 56, 20]) tensor([ 9, 375, 33, 213]) tensor([ 58, 136, 264, 172]) tensor([ 30, 12, 582, 6]) tensor([150, 450, 10, 20]) tensor([ 93, 147, 1, 389]) tensor([ 48, 221, 30, 331]) tensor([541, 30, 46, 13]) tensor([ 24, 169, 410, 27]) tensor([ 10, 293, 92, 235]) tensor([ 0, 36, 1, 593]) tensor([116, 0, 514, 0]) tensor([1, 0, 0, 0]) tensor([0, 0, 0, 1]) tensor([0, 0, 0, 1]) tensor([1, 0, 0, 0]) tensor([1, 0, 0, 0]) tensor([0, 1, 0, 0]) tensor([0, 0, 0, 1]) tensor([0, 1, 0, 0]) tensor([1, 0, 0, 0]) tensor([0, 1, 0, 0]) tensor([0, 0, 0, 1]) tensor([1, 0, 0, 0]) tensor([0, 0, 1, 0]) tensor([0, 0, 1, 0]) tensor([0, 0, 0, 1]) tensor([0, 0, 1, 0]) Red

Case2:

tensor([ 63, 140, 157, 270]) tensor([ 8, 188, 0, 434]) tensor([200, 46, 52, 332]) tensor([136, 314, 14, 166]) tensor([351, 201, 59, 19]) tensor([ 7, 371, 36, 216]) tensor([ 55, 134, 261, 180]) tensor([ 34, 14, 579, 3]) tensor([156, 443, 10, 21]) tensor([ 95, 147, 1, 387]) tensor([ 48, 222, 29, 331]) tensor([548, 31, 40, 11]) tensor([ 26, 168, 411, 25]) tensor([ 8, 296, 97, 229]) tensor([ 0, 39, 3, 588]) tensor([113, 0, 517, 0]) tensor([0, 1, 0, 0]) tensor([0, 0, 0, 1]) tensor([0, 0, 0, 1]) tensor([0, 0, 0, 1]) tensor([1, 0, 0, 0]) tensor([0, 1, 0, 0]) tensor([0, 0, 1, 0]) tensor([0, 0, 0, 1]) tensor([1, 0, 0, 0]) tensor([0, 1, 0, 0]) tensor([0, 0, 0, 1]) tensor([1, 0, 0, 0]) tensor([0, 0, 1, 0]) tensor([0, 0, 1, 0]) tensor([0, 0, 0, 1]) tensor([0, 0, 1, 0]) Brown

Please check if this phenomenon is common or if there is a bug in code.

Thanks

PKU-YuanGroup / MoE-LLaVA

[Question] The evaluation results vary every time. #60

Question

Question