Closed gpucce closed 1 year ago
Hi, this implementation should behave exactly like the original model.
The only differences are related to the numerical instabilities of Linear
with respect to RowParallelLinear
and ColumnParallelLinear
.
Mathematically the outputs should be equal but there are some slight differences that can build up to result in a non-negligible output difference.
@galatolofederico thank you very much for the reply! I noticed this behaviour and also thought this could be the reason. I was curious if you had seen the same.
@galatolofederico thanks for making this. Are you able to replicate the original model output exactly with this approach? maybe checking on one of the smaller models?