Distributed training of transformer models fails due to unused pooling layer (includes fix)

visionscaper commented 2 years ago

Hello,

First of all, thank you for making this library available.

I created code to fine-tune sentence transformer models by performing data-parallel training, distributed over multiple GPUs. The implementation follows the standard Pytorch DistributedDataParallel receipt.

Issue

In this context I encounter the following when trying to finetune the sentence transformer model based on the pre-trained stsb-roberta-base-v2:

Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 197 198
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient
on this rank as part of this error

Cause of issue

Huggingface implementations of models like Roberta add a pooling layer by default which is not used by the sentence transformer and thus do not participate in calculating the loss.

Fixing the issue

Although a simple way to suppress this error would be to set find_unused_parameters=True for DistributedDataParallel, this can still cause issues. For instance, in my case a run-time error occurs where I'm not allowed to detach views in place.

A better, more fundamental solution would be to allow to set add_pooling_layer=False. To make this fix more generic, it would be good to be able to inject a dictionary with any required custom Huggingface parameters (custom_hf_params).

I have implemented this solution in this fork.

It would be great if this (or similar) fix could be merged in to this repo!

visionscaper commented 2 years ago

I took the liberty to already offer my fix in this pull request. Please let me know what you think.

dshopin commented 2 years ago

Hello,

First of all, thank you for making this library available.

The implementation follows the standard Pytorch DistributedDataParallel receipt.

Do you mind to share your code - how did you wrap sentence-transformers model into DDPand trained it?

UKPLab / sentence-transformers