Training a transformer classifier with DDP

david-waterworth commented 2 years ago

I've pre-trained my own Huggingface Roberta transformer and I'm fine-tuning a classifier with it, I'm using the following components

    embedder: {
        token_embedders: {
            tokens: {
                type: "pretrained_transformer",
                model_name: MODEL_NAME,
                transformer_kwargs: {
                    add_pooling_layer: false
                }
            }
        },
    },
    encoder: {
        type: "pass_through",
        input_dim: MODEL_DIM
    },
    pooler: {
        type: "bert_pooler",
        pretrained_model: MODEL_NAME
    },

This works fine if I train on a single GPU however, it fails when I try to use DDP - I get errors about unused parameters not contributing to the loss.

Looking through the code, it seems to be related to the pooler. When you load a RobertaForMaskedLM using AutoModel.from_pretrained it drops the language model head and adds a pooler layer.

Looking at the PretrainedTransformerEmbedder code, it loads the model from the cache (copy=True) including the pooler - however in forward the embeddings are extracted before the pooler.

In BertPooler however, it loads the model from the cache (copy= False) and then deep copies the pooler

This means there are two poolers and the first doesn't contribute to gradients if I'm reading things correctly.

Also if I pass transformer_kwargs: { add_pooling_layer: false } to the embedder there's no pooler at all and the bert_pooler throws an exception.

Is this not the intended way of using the pooler?

As an aside, oddly torch recommends setting find_unused_parameters=true which I assumed was diagnosis but it actually seems to fix the problem?

huhk-sysu commented 2 years ago

I met the similar error. I changed following find_unused_parameters to True and the error was gone. https://github.com/allenai/allennlp/blob/20df7cdd3eea7f895ceee9c57e2be1a843510748/allennlp/nn/parallel/ddp_accelerator.py#L130-L141 Though I don't know how to configure it.

david-waterworth commented 2 years ago

@huhk-sysu I managed to configure find_unused_parameters as follows

distributed: {
    cuda_devices: if NUM_GPUS > 1 then std.range(0, NUM_GPUS - 1) else 0,
    ddp_accelerator: {
        type: "torch",
        find_unused_parameters: true
    }
},

By explicitly adding ddp_accelerator you can set the parameters - otherwise, as you show in the code above it creates a default with find_unused_parameters=False.

huhk-sysu commented 2 years ago

@huhk-sysu I managed to configure find_unused_parameters as follows
distributed: {
    cuda_devices: if NUM_GPUS > 1 then std.range(0, NUM_GPUS - 1) else 0,
    ddp_accelerator: {
        type: "torch",
        find_unused_parameters: true
    }
},
By explicitly adding ddp_accelerator you can set the parameters - otherwise, as you show in the code above it creates a default with find_unused_parameters=False.

Thanks, it works for me.

huhk-sysu commented 2 years ago

if NUM_GPUS > 1 then std.range(0, NUM_GPUS - 1) else 0

@david-waterworth

It may fail when NUM_GPUS == 1. Maybe following is better:

distributed: if NUM_GPUS > 1 then {
    cuda_devices: std.range(0, NUM_GPUS - 1),
    ddp_accelerator: {
        type: "torch",
        find_unused_parameters: true
    }
}

github-actions[bot] commented 2 years ago

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

allenai / allennlp

Training a transformer classifier with DDP #5682