Hey! I'm fascinated by work finding competitors to Adam, particularly since from an interoperability perspective it may have some strange properties, such as probably causing outlier large dimensions in the residual stream.
Do you have access to any of the Language Models (such as GPT-2 Small sized models) that you trained, to investigate this?
Hey! I'm fascinated by work finding competitors to Adam, particularly since from an interoperability perspective it may have some strange properties, such as probably causing outlier large dimensions in the residual stream.
Do you have access to any of the Language Models (such as GPT-2 Small sized models) that you trained, to investigate this?