Closed david-waterworth closed 2 years ago
I met the similar error. I changed following find_unused_parameters
to True and the error was gone.
https://github.com/allenai/allennlp/blob/20df7cdd3eea7f895ceee9c57e2be1a843510748/allennlp/nn/parallel/ddp_accelerator.py#L130-L141
Though I don't know how to configure it.
@huhk-sysu I managed to configure find_unused_parameters
as follows
distributed: {
cuda_devices: if NUM_GPUS > 1 then std.range(0, NUM_GPUS - 1) else 0,
ddp_accelerator: {
type: "torch",
find_unused_parameters: true
}
},
By explicitly adding ddp_accelerator
you can set the parameters - otherwise, as you show in the code above it creates a default with find_unused_parameters=False
.
@huhk-sysu I managed to configure
find_unused_parameters
as followsdistributed: { cuda_devices: if NUM_GPUS > 1 then std.range(0, NUM_GPUS - 1) else 0, ddp_accelerator: { type: "torch", find_unused_parameters: true } },
By explicitly adding
ddp_accelerator
you can set the parameters - otherwise, as you show in the code above it creates a default withfind_unused_parameters=False
.
Thanks, it works for me.
if NUM_GPUS > 1 then std.range(0, NUM_GPUS - 1) else 0
@david-waterworth
It may fail when NUM_GPUS
== 1. Maybe following is better:
distributed: if NUM_GPUS > 1 then {
cuda_devices: std.range(0, NUM_GPUS - 1),
ddp_accelerator: {
type: "torch",
find_unused_parameters: true
}
}
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇
I've pre-trained my own Huggingface Roberta transformer and I'm fine-tuning a classifier with it, I'm using the following components
This works fine if I train on a single GPU however, it fails when I try to use DDP - I get errors about unused parameters not contributing to the loss.
Looking through the code, it seems to be related to the pooler. When you load a
RobertaForMaskedLM
usingAutoModel.from_pretrained
it drops the language model head and adds a pooler layer.Looking at the
PretrainedTransformerEmbedder
code, it loads the model from the cache (copy=True) including the pooler - however inforward
the embeddings are extracted before the pooler.In
BertPooler
however, it loads the model from the cache (copy= False) and then deep copies the poolerThis means there are two poolers and the first doesn't contribute to gradients if I'm reading things correctly.
Also if I pass
transformer_kwargs: { add_pooling_layer: false }
to the embedder there's no pooler at all and the bert_pooler throws an exception.Is this not the intended way of using the pooler?
As an aside, oddly torch recommends setting
find_unused_parameters=true
which I assumed was diagnosis but it actually seems to fix the problem?