Multi-GPU Training with DP or DDP combined with reentrant gradient checkpointing dies at first backward pass

olivierr42 commented 4 months ago

I am trying to train on a 8xA100 instance. If I set trainer_arguments.gradient_checkpointing to True, the training hangs for a while and then dies with a Segmentation fault (core dumped) error. The error does not occur on a single GPU node, and it does not happen if gradient checkpointing is not enabled. As a precision: setting gradient_checkpointing_kwargs to {"use_reentrant":False} works, but I think. the default settings (which are to use the reentrant variant of checkpointing) should work.

I am using the MultipleNegativeRankingLoss with an appropriate dataset.

I am almost certain that this is not a SentenceTransformers issue, but since gradient checkpointing is used by the biggest sentence embedding solutions, I am seeking some help here.

Thank you!

vaibhavad commented 3 months ago

Hi @olivierr42,

I am facing a similar issue, however, I think this is a transformers issue, and not sentence-transformers. Did you find any fix for this?

olivierr42 commented 3 months ago

Hi @olivierr42,

I am facing a similar issue, however, I think this is a transformers issue, and not sentence-transformers. Did you find any fix for this?

Just fixing gradient_checkpointing_kwargs to {"use_reentrant":False} worked for me.

thusinh1969 commented 1 month ago

ddp_find_unused_parameters=True

That is it if you use gradient_checkpointing=True.

Steve

UKPLab / sentence-transformers

Multi-GPU Training with DP or DDP combined with reentrant gradient checkpointing dies at first backward pass #2844