cnr-isti-vclab / TagLab

A CNN based image segmentation tool oriented to marine data analysis
https://taglab.isti.cnr.it/
GNU General Public License v3.0
84 stars 33 forks source link

Errors with non-CUDA machine #136

Open eloralopez opened 7 months ago

eloralopez commented 7 months ago

Discussed in https://github.com/cnr-isti-vclab/TagLab/discussions/81

I am having a similar problem to the one described in the post quoted below, even though it appears that MapClassifier.py has been updated to incorporate the fix that the other user described.

When I try to use "Train Your Network", I get this error:

Traceback (most recent call last): File "TagLab.py", line 4139, in trainNewNetwork dataset_train_info, train_loss_values, val_loss_values = training.trainingNetwork(images_dir_train, labels_dir_train, File "/Users/eln/TagLab/models/training.py", line 297, in trainingNetwork state = torch.load("models/deeplab-resnet.pth.tar") File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1040, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1268, in _legacy_load result = unpickler.load() File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1205, in persistent_load wrap_storage=restore_location(obj, location), File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 391, in default_restore_location result = fn(storage, location) File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 266, in _cuda_deserialize device = validate_cuda_device(location) File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 250, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I looked at source/MapClassifier.py since it was mentioned in the previous discussion, and in lines 98-101 it looks like it should use torch.load with "cpu" since torch.cuda.is_available() is False, so this does not appear to be the same problem that the previous user ran into and fixed.

The problem appears to be arising in training.py , but I haven't figured out what it is yet. Any assistance would be appreciated!

Originally posted by **andieich** January 26, 2023 Hi, I successfully installed TagLab on a Windows computer, but had some issues. I cannot use the GPU since it is made by Intel. I therefore tried to install the CPU version of torch and tochvision. When I use the `install.py` script, I get this error: ``` append() takes exactly one argument (2 given) ``` This is caused by line 234. I commented out lines 232 - 236 and manually installed both packages with the following code: ``` pip install torch --extra-index-url https://download.pytorch.org/whl/cpu pip install torchvision --extra-index-url https://download.pytorch.org/whl/cpu ``` Afterwards, I could install TagLab flawlessly. However, when trying to run a auto segmentation, I got this error: ``` RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. ``` So I did as proposed and changed in `source/MapClassifier.py` in line 98: ``` classifier.load_state_dict(torch.load(network_name) ``` to ``` classifier.load_state_dict(torch.load(network_name, map_location=torch.device('cpu'))) ``` Now, everything works fine. Maybe that's something to consider for the next TagLab version. Thanks for this amazing software!
eloralopez commented 7 months ago

I fixed the issue by editing both training.py and losses.py to use the cpu-version of Torch. This arose in multiple places in both scripts:

In losses.py: Line 28 add: `

USE_CUDA = torch.cuda.is_available()
if USE_CUDA:

    device = torch.device("cuda")

else:

    device = torch.device("cpu")
   net.to(device)`

Lines 40 and 62 change: dist_maps_tensor = dist_maps_tensor.to(device='cuda:0') to dist_maps_tensor = dist_maps_tensor.to(device)

in surface_loss function add: `

USE_CUDA = torch.cuda.is_available()
if USE_CUDA:
    device = torch.device("cuda")

else:

    device = torch.device("cpu")`

Line 80 change one_hot = one_hot.to('cuda:0') to one_hot = one_hot.to('cpu')

In training.py : Line 103, add: `

else:
    device = torch.device("cpu")

    net.to(device)

    torch.cpu.synchronize()`

Line 297, change state = torch.load("models/deeplab-resnet.pth.tar") to state = torch.load("models/deeplab-resnet.pth.tar", map_location=torch.device("cpu"))

Line 333, change class_weights = torch.FloatTensor(weights).cuda() to class_weights = torch.FloatTensor(weights).cpu()

Line 445, remove torch.cuda.empty_cache()

Line 479, change net.load_state_dict(torch.load(network_filename)) to net.load_state_dict(torch.load(network_filename, map_location=torch.device("cpu")))