`torch.compile` issue when computing features on multiple GPUs (`nn.DataParallel`)

TIA Toolbox version: develop branch
Python version: 3.11.8
Operating System: linux

Description

I am computing the features using multiple GPUs on the same node using DeepFeatureExtractor

What I Did

This was handled by nn.DataParallel built-in within tiatoolbox. I pulled the changes that introduced torch.compile and changed from ON_GPU to using device.

I updated the argument in the DeepFeatureExtractor's predict method to use device instead of on_gpu.

Errors traceback is very long to paste it all. But here are some of the errors (from the single run).

  File "/tmp/torchinductor_qun786/vv/cvvkeueuq2m4jcjzub4hcfpkhpogtc5b2xddykdgxvsxcvnpfa2w.py", line 173, in call                                               
    buf2 = extern_kernels.convolution(buf0, buf1, stride=(14, 14), padding=(0, 0), dilation=(1, 1), transposed=False, output_padding=(0, 0), groups=1, bias=Non
e)                                                                                                                                                                                                                                                                                                                
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in 
method wrapper_CUDA__cudnn_convolution)  

...

    raise exception                                                                                                                                            
RuntimeError: Caught RuntimeError in replica 0 on device 0.  

...

RuntimeError: Triton Error [CUDA]: invalid device context

What I can gather is that torch.compile is not working well with nn.DataParallel.

TissueImageAnalytics / tiatoolbox

`torch.compile` issue when computing features on multiple GPUs (`nn.DataParallel`) #889

Description

What I Did