TissueImageAnalytics / tiatoolbox

Computational Pathology Toolbox developed by TIA Centre, University of Warwick.
https://warwick.ac.uk/tia
Other
389 stars 81 forks source link

`torch.compile` issue when computing features on multiple GPUs (`nn.DataParallel`) #889

Open GeorgeBatch opened 1 day ago

GeorgeBatch commented 1 day ago

Description

I am computing the features using multiple GPUs on the same node using DeepFeatureExtractor

What I Did

This was handled by nn.DataParallel built-in within tiatoolbox. I pulled the changes that introduced torch.compile and changed from ON_GPU to using device.

I updated the argument in the DeepFeatureExtractor's predict method to use device instead of on_gpu.

Errors traceback is very long to paste it all. But here are some of the errors (from the single run).

  File "/tmp/torchinductor_qun786/vv/cvvkeueuq2m4jcjzub4hcfpkhpogtc5b2xddykdgxvsxcvnpfa2w.py", line 173, in call                                               
    buf2 = extern_kernels.convolution(buf0, buf1, stride=(14, 14), padding=(0, 0), dilation=(1, 1), transposed=False, output_padding=(0, 0), groups=1, bias=Non
e)                                                                                                                                                                                                                                                                                                                
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in 
method wrapper_CUDA__cudnn_convolution)  

...

    raise exception                                                                                                                                            
RuntimeError: Caught RuntimeError in replica 0 on device 0.  

...

RuntimeError: Triton Error [CUDA]: invalid device context

What I can gather is that torch.compile is not working well with nn.DataParallel.

GeorgeBatch commented 1 day ago

Please let me know if you can reproduce the error by simply running the DeepFeatureExtractor feature extraction code with rcParam["torch_compile_mode"] = "default" on a node with at least 2 devices.

Maybe nn.DistributedDataParallel is a better option to use: https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead

https://github.com/TissueImageAnalytics/tiatoolbox/blob/5f1cecbc81e0e6953a067c159c4ac1da948ba5c9/tiatoolbox/models/models_abc.py#L42-L61