Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.76k stars 1.06k forks source link

Saving networks to file from GPU #1273

Open rijobro opened 3 years ago

rijobro commented 3 years ago

Describe the bug We should be able to save a network to file, read it back in and continue using it. We should be able to do this whether the network was on the CPU or GPU, yet when using the GPU, some integration tests fail. See attempts and conversation here: https://github.com/Project-MONAI/MONAI/pull/1268.

YuanTingHsieh commented 3 years ago

After some attempts, I found that the problem is not in saving and loading back. Just by scripting the module

sm = torch.jit.script(network) result1 = sm(input) result2 = network(input)

The result1 and result2 do not match, that is np.testing.assert_allclose will fail.

I don't know how to solve this and have not found a related issue in PyTorch.

Nic-Ma commented 3 years ago

Hi @YuanTingHsieh ,

May I know which network are you using in your test? And have you added with torch.no_grad() and network.eval()? A sample code for your reference:

    scripted = torch.jit.script(net.cpu())
    buffer = scripted.save_to_buffer()
    reloaded_net = torch.jit.load(BytesIO(buffer)).to(device)
    net.to(device)

    if eval_nets:
        net.eval()
        reloaded_net.eval()

    with torch.no_grad():
        set_determinism(seed=0)
        result1 = net(*inputs)
        result2 = reloaded_net(*inputs)

Thanks.

wyli commented 3 years ago

I think we need to narrow down this issue, could you test locally with a very small network, for example, adapting the basic unet https://github.com/Project-MONAI/MONAI/blob/40b041467bdd013c3e1bf62722a3f30fe8085eda/monai/networks/nets/basic_unet.py#L137

start from one conv:

class BasicUNet(nn.Module):
    def __init__(self):
       super().__init__()
       self.conv_0 = TwoConv(dimensions, in_channels, features[0], act, norm, dropout)
    def forward(self, x: torch.Tensor):
        x0 = self.conv_0(x)
        return x0

and then gradually add more modules to this class until we can reproduce it

YuanTingHsieh commented 3 years ago

Hi All,

The networks that are failing are "VNet", "SENET", "SEBlockLayer", "DENSENET".

I did use with torch.no_grad() and network.eval()

Refer to here for the code: https://github.com/Project-MONAI/MONAI/compare/master...YuanTingHsieh:yuanting_work You can refer to here for testing results: https://github.com/YuanTingHsieh/MONAI/runs/1476285474?check_suite_focus=true

Nic-Ma commented 3 years ago

Hi @YuanTingHsieh ,

Could you please help check this test: https://github.com/Project-MONAI/MONAI/blob/master/tests/test_vnet.py#L75 I think it can get same results in test?

Thanks.

YuanTingHsieh commented 3 years ago

I've used CPU to script and then loaded back using GPU to evaluate.

I've found that only "PT16+CUDA102" environment has the issue. which is pytorch: "torch==1.6.0 torchvision==0.7.0" base: "nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04"

Other environments are good. Guess this combination is not well tested/verified in GPU.

Nic-Ma commented 3 years ago

Cool, seems PyTorch 1.6 TorchScript may not support CUDA10.2 very well, anyway, let's always move the model to CPU before TorchScript converting.

Thanks.

YuanTingHsieh commented 3 years ago

Yeah, for the "PyTorch+CUDA10.2" environment whether you do torch.jit.script on CPU or GPU.

If you evaluate on GPU, The original model's result will be different from the converted model's result

rijobro commented 3 years ago

Because of the problems, the unit test that saves the network and loads it back in only uses the CPU (see https://github.com/Project-MONAI/MONAI/pull/1268). Now that @YuanTingHsieh has figured out which is the problematic configuration, it would be good if all the other configurations also tested the GPU if available.