Open GeorgeBatch opened 1 day ago
Please let me know if you can reproduce the error by simply running the DeepFeatureExtractor
feature extraction code with rcParam["torch_compile_mode"] = "default"
on a node with at least 2 devices.
Maybe nn.DistributedDataParallel
is a better option to use: https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead
Description
I am computing the features using multiple GPUs on the same node using
DeepFeatureExtractor
What I Did
This was handled by
nn.DataParallel
built-in withintiatoolbox
. I pulled the changes that introducedtorch.compile
and changed fromON_GPU
to usingdevice
.I updated the argument in the DeepFeatureExtractor's
predict
method to usedevice
instead ofon_gpu
.Errors traceback is very long to paste it all. But here are some of the errors (from the single run).
What I can gather is that
torch.compile
is not working well withnn.DataParallel
.