Wrong training procedure?

repodiac commented 3 years ago

I trained an extension of model sentence-transformers/paraphrase-multilingual-mpnet-base-v2 (see #235).

After training I used the script save_pretrained_hf.py in order to convey it to a HuggingFace Transformer-compatible format.

When I now run the example code for mean-pooling embeddings I get the following warning (output_bs32_ep20_export is my exported model):

Some weights of the model checkpoint at /tf/data/output_bs32_ep20_export were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at /tf/data/output_bs32_ep20_export and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Any idea why this occurs? Is it true what the warning says or can I ignore it?

JohnGiorgi commented 3 years ago

Hmm, your pretrained model does not have weights for ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']

Did you set masked_language_modelling to True in the config? If so the model would have been loaded with AutoModelForMaskedLM (see here) and I would have expected those weights to have been trained.

Still, maybe I am wrong and lm_head is not used by your particular model. I think it is still worth evaluating this model you have trained and see if it performs well on your downstream tasks.

repodiac commented 3 years ago

the config is equal to your original declutr.jsonnet - besides the min/max length issue (see #235), thus masked_language_modelling set to True is the case (just checked)

"model": {
        "type": "declutr",
        "text_field_embedder": {
            "type": "mlm",
            "token_embedders": {
                "tokens": {
                    "type": "pretrained_transformer_mlm",
                    "model_name": transformer_model,
                    "masked_language_modeling": true
                },
            },
        },
        "loss": {
            "type": "nt_xent",
            "temperature": 0.05,
        },
        // There was a small bug in the original implementation that caused gradients derived from
        // the contrastive loss to be scaled by 1/N, where N is the number of GPUs used during
        // training. This has been fixed. To reproduce results from the paper, set this to false.
        // Note that this will have no effect if you are not using distributed training with more
        // than 1 GPU.
        "scale_fix": false
    },

However, as I wrote in https://github.com/JohnGiorgi/DeCLUTR/issues/118#issuecomment-927912463 in the continued/restarted runs I used the first model as from_archive: Is that the problem?

    "model": {
        "type": "from_archive",
        "archive_file": "/notebooks/DeCLUTR/output_bs32_ep10/model.tar.gz"
    },

the underlying model according to huggingface seems to be XLMRobertaModel - does it not use the referenced lm_head hyperparameters in training? I doubt it...
Something has been trained for sure :) The embeddings are signficantly different to the base model (sentence-transformers/paraphrase-multilingual-mpnet-base-v2) when used for semantic textual similarity, but I wonder if I miss out something here if the model "complains" in such a manner?

Any clarification is highly appreciated!

JohnGiorgi commented 3 years ago

I think you are free to ignore these messages. I imagine this happens because somewhere during loading of the model, AutoModel.from_pretrained is used, so the weights of lm_head are not initialized, which is OK because we don't use them to produce sentence embeddings.

repodiac commented 3 years ago

I have to admit that I am not particularly familiar enough with the underyling XLMRobertaModel, but lm_head sounds to me like the last hidden layer (in general, you put a task-specific header on top, e.g. softmax for classification tasks etc.) So for embeddings I would expect lm_head to be used as last layer?

JohnGiorgi commented 3 years ago

The example code you cited uses mean pooling on the token embeddings from the model's last transformer block. This doesn't require lm_head.

JohnGiorgi commented 3 years ago

Closing this, feel free to re-open if you are still having issues.

JohnGiorgi / DeCLUTR

Wrong training procedure? #237