Minor updates to multi-gpu training

constantinpape / torch-em

Deep-learning based semantic and instance segmentation for 3D Electron Microscopy and other bioimage analysis problems based on pytorch.

MIT License

69 stars 17 forks source link

Minor updates to multi-gpu training #294

Closed anwai98 closed 1 month ago

anwai98 commented 1 month ago

Hi @constantinpape,

I made some minor changes to the DDP-based training to fit our SAM finetuning. ~There are a few issues in handling mixed_precision, would be good to take a look at it.~

anwai98 commented 1 month ago

~For reference, this is the error I get:~ ~AssertionError: No inf checks were recorded for this optimizer.~

Edit: The issue above is fixed now (the model expects the params from the DDP wrapped model). Next, I encountered some synchronization issues, found out that there's a parameter find_unused_parameters in DDP which takes care of these issues dynamically, however it makes the training very slow (need to investigate this).

anwai98 commented 1 month ago

@constantinpape This is GTG from my side.