difference between gluonnlp 0.10.0 and gluonnlp 1.0.0 RoBERTaModel?

I'm working on converting a RoBERTa model to gluonnlp 0.10.0 with mxnet 1.7.0.

I managed to get it working in gluonnlp 1.0.0 and mxnet 2.0.0 and the activations in the hidden layers are the same as the source model, but in gluonnlp 0.10.0 and mxnet 1.7.0 they differ very slightly.

The discrepancy starts in the first layer so I'm assuming it has something to do with the embeddings.

I could have made a mistake somewhere, but I'm wondering if there's a simpler explanation.

dmlc / gluon-nlp

difference between gluonnlp 0.10.0 and gluonnlp 1.0.0 RoBERTaModel? #1563