When the DS launch up a remote training, on DO side, report "TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first."
How to Reproduce
run the code line-by-line, everything works fine, until arriving to PART 3: Training. (I have a GPU and CUDA )
Training will stop at epoch 1 and no progress anymore.
on DO side I can see the error report as above "TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first."
Expected Behavior
This is a classic issue for general ML and I can find solution, but how to handle this by using FL lib (by which the training happen on DO side actually)
System Information
OS: ubuntu18.04
Language Version: Python:3.7.10, torch:1.8.1, torchvision:0.9.1
Description
When the DS launch up a remote training, on DO side, report "TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first."
How to Reproduce
Expected Behavior
This is a classic issue for general ML and I can find solution, but how to handle this by using FL lib (by which the training happen on DO side actually)
System Information