choosehappy / QuickAnnotator

An open-source digital pathology based rapid image annotation tool
BSD 3-Clause Clear License
74 stars 27 forks source link

Map location set to device in baseline model load #30

Closed VolodymyrChapman closed 1 year ago

VolodymyrChapman commented 1 year ago

I use my QA projects across devices and a problem encountered was deserialization if the GPU/device used for creating the baseline model was different to the device used for further training from baseline. Specifically, where I had trained baseline on a device with multiple GPUs and was retraining on a laptop with a single GPU. The error message encountered was identical to the one outlined here: https://stackoverflow.com/questions/56369030/runtimeerror-attempting-to-deserialize-object-on-a-cuda-device where users of PyTorch attempted to retrain models using CPU instead of GPU. In my case, I had used GPU ID 1 in a multiGPU system for baseline training and was trying to use GPU 0 in retraining on a single GPU system. All other software (conda env, Ubuntu 20.04, NVidia drivers etc.) was identical between systems. Changing GPU ID in the config.ini file was not sufficient to resolve the error. The PyTorch deserialization error was traced to line 169 of this file (train_model.py) in the STDERROR, corresponding with the torch.load() line. As outlined in the above stackoverflow, the solution is through unambiguous assignment of the device to load the model to, using the map_location argument of the PyTorch load() function. By retrieving the device variable one line earlier and unambiguously mapping to the desired device using the map_location argument in line 169, this error was resolved with no other changes in behaviour.

choosehappy commented 1 year ago

nice catch. in other code we use this approach

https://github.com/choosehappy/PytorchDigitalPathology/blob/d9925f3795c47112b6ccbffdc780b1e4ddd72145/segmentation_epistroma_unet/make_output_unet_cmd.py#L53

have you already seen it?

would this or that approach be more ideal?

VolodymyrChapman commented 1 year ago

Hi Andrew! Yeah, I came across this solution a few hours after posting (D'Oh!). So, that's a really good question which appears much simpler than it is. According to the documentation (https://pytorch.org/docs/stable/generated/torch.load.html):

torch.load() uses Python’s unpickling facilities but treats storages, which underlie tensors, specially. They are first deserialized on the CPU and are then moved to the device they were saved from. If this fails (e.g. because the run time system doesn’t have certain devices), an exception is raised. However, storages can be dynamically remapped to an alternative set of devices using the map_location argument.

According to that, the two solutions should be identical in first mapping to CPU then GPU (or staying on CPU, if user requested). But, to check whether any hidden overheads existed, I ran a little token QA model loading task (below), refreshing the kernel between Jupyter cells to eliminate effects of any jit compilation etc.: image

Surprisingly, the solution you proposed (top cell) was ~7 times faster! Will modify my proposed pull request and thank you for the tip! Best wishes, V

choosehappy commented 1 year ago

Wow, 7 times! Thanks for running the quick experiment.

A quick tip, you can use the magic commands in jupyter to do timings for you in a less code-intensive way : )

e.g., %%time, %%timeit

https://ipython.org/ipython-doc/3/interactive/magics.html

VolodymyrChapman commented 1 year ago

Neat - ta muchly!