Open eloralopez opened 7 months ago
I fixed the issue by editing both training.py and losses.py to use the cpu-version of Torch. This arose in multiple places in both scripts:
In losses.py: Line 28 add: `
USE_CUDA = torch.cuda.is_available()
if USE_CUDA:
device = torch.device("cuda")
else:
device = torch.device("cpu")
net.to(device)`
Lines 40 and 62 change:
dist_maps_tensor = dist_maps_tensor.to(device='cuda:0')
to dist_maps_tensor = dist_maps_tensor.to(device)
in surface_loss function add: `
USE_CUDA = torch.cuda.is_available()
if USE_CUDA:
device = torch.device("cuda")
else:
device = torch.device("cpu")`
Line 80 change one_hot = one_hot.to('cuda:0')
to one_hot = one_hot.to('cpu')
In training.py : Line 103, add: `
else:
device = torch.device("cpu")
net.to(device)
torch.cpu.synchronize()`
Line 297, change state = torch.load("models/deeplab-resnet.pth.tar")
to state = torch.load("models/deeplab-resnet.pth.tar", map_location=torch.device("cpu"))
Line 333, change class_weights = torch.FloatTensor(weights).cuda()
to class_weights = torch.FloatTensor(weights).cpu()
Line 445, remove torch.cuda.empty_cache()
Line 479, change net.load_state_dict(torch.load(network_filename))
to net.load_state_dict(torch.load(network_filename, map_location=torch.device("cpu")))
Discussed in https://github.com/cnr-isti-vclab/TagLab/discussions/81
I am having a similar problem to the one described in the post quoted below, even though it appears that MapClassifier.py has been updated to incorporate the fix that the other user described.
When I try to use "Train Your Network", I get this error:
Traceback (most recent call last): File "TagLab.py", line 4139, in trainNewNetwork dataset_train_info, train_loss_values, val_loss_values = training.trainingNetwork(images_dir_train, labels_dir_train, File "/Users/eln/TagLab/models/training.py", line 297, in trainingNetwork state = torch.load("models/deeplab-resnet.pth.tar") File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1040, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1268, in _legacy_load result = unpickler.load() File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1205, in persistent_load wrap_storage=restore_location(obj, location), File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 391, in default_restore_location result = fn(storage, location) File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 266, in _cuda_deserialize device = validate_cuda_device(location) File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 250, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
I looked at source/MapClassifier.py since it was mentioned in the previous discussion, and in lines 98-101 it looks like it should use torch.load with "cpu" since torch.cuda.is_available() is False, so this does not appear to be the same problem that the previous user ran into and fixed.
The problem appears to be arising in training.py , but I haven't figured out what it is yet. Any assistance would be appreciated!