Closed lijm071 closed 2 weeks ago
For the category frequency vector p = (p1, p2, ..., pn), pλ is used in softmax to adjust the category distribution weights learned by the model. When λ is 0, the model weights learned from long-tail distribution data also follow a long-tail distribution. When λ equals 1, it reaches a balanced state, i.e., Balance Softmax. When λ is less than 0, it changes pλ, making the model’s weights focus more on the head categories. You can plug in different values of λ to see how they affect the category frequency vector p and understand the relationships between them.
I would like to ask when λ < 0, why does the expert model focus on the head categories? Shouldn't ”λ < 0“ lead to a decrease in the predicted probability of the head categories?