ProGamerGov / neural-style-pt

PyTorch implementation of neural style transfer algorithm
MIT License
834 stars 178 forks source link

Multigpu Capability #2

Closed RexCATCAT closed 4 years ago

RexCATCAT commented 5 years ago

Fantastic work! Manage to get it to run on win10.

Just curious is it capable of the -multigpu argument like the one that runs on torch7?

ProGamerGov commented 5 years ago

Multi-GPU support is in the works, but I've run into issues with some PyTorch differences (while replicating the Lua/Torch7 code).

LouSparfell commented 5 years ago

Hello ProGamerGov and thanks for your tool.

Any news about this support? Is it still planned? I would love to be able to use multiple GPUs too.

ProGamerGov commented 5 years ago

@LouSparfell I attempted to implement multi-GPU support here: https://github.com/ProGamerGov/neural-style-pt/tree/multi-gpu, but I've run into a bunch of issues. I don't have a readily available computer with multiple GPUs either, so I can't really test things.

You are welcome to submit a pull request if you are able to get it working!

ajhool commented 5 years ago

@ProGamerGov I'd like to make an attempt at writing the multigpu code, would you have any free time to do a quick chat about what issues you have already run into so I don't trip over the same things?

ProGamerGov commented 5 years ago

@ajhool It's been a while since I was trying to get multigpu working, but my main issues were testing with multiple GPUs and dealing with the feval function (as PyTorch doesn't really have an exact equivalent of nn.GPU).

ajhool commented 5 years ago

Okay, thanks. I'll check it out and see how things go

ProGamerGov commented 5 years ago

@ajhool Alight, let me know how things go!

ProGamerGov commented 4 years ago

@ajhool I made progress on multi device support using this guide I found: https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

Here's the current multi-gpu/multi-device branch: https://github.com/ProGamerGov/neural-style-pt/tree/multi-gpu

Unfortunately, I am stuck at this error:

ubuntu@ip-Address:~/neural-style-pt$ python3 neural_style.py -gpu 0,1,2,3
VGG-19 Architecture Detected
Successfully loaded models/vgg19-d01eb7cb.pth
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
Setting up style layer 2: relu1_1
Setting up style layer 7: relu2_1
Setting up style layer 12: relu3_1
Setting up style layer 21: relu4_1
Setting up content layer 23: relu4_2
Setting up style layer 30: relu5_1
4
['cuda:0', 'cuda:1', 'cuda:2', 'cuda:3']
Capturing content targets
Traceback (most recent call last):
  File "neural_style.py", line 466, in <module>
    main()
  File "neural_style.py", line 164, in main
    net(content_image)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 538, in __call__
    for hook in self._forward_pre_hooks.values():
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 591, in __getattr__
    type(self).__name__, name))
AttributeError: 'ModelParallelModel' object has no attribute '_forward_pre_hooks'
ubuntu@ip-Address:~/neural-style-pt$

This is the setup function for multiple devices: https://github.com/ProGamerGov/neural-style-pt/blob/multi-gpu/neural_style.py#L303-L335

This is the class I created for the multi-device model: https://github.com/ProGamerGov/neural-style-pt/blob/multi-gpu/CaffeLoader.py#L110-L124

I got the model spread out across all of the selected devices successfully, but for some reason I can't run the input images through it. Any ideas?

ProGamerGov commented 4 years ago

Using multiple GPUs is now possible! Though I am not sure how I am going to deal with the feval function in a more elegant way. I am also not sure what if any change the new code will have on neural-style-pt's speed, because it will have redundant to.(device) inside the feval function when using only a single device.

It should be possible to use both multiple GPUs and the CPU at the same time, or just a single GPU and the CPU at the same time. But I believe that will require some sort of conversion between tensor types in order to work.

Due to this ability to use multiple CPUs and GPUs, I renamed the -multigpu_strategy parameter to -multidevice_strategy. Other than that, it should work exact same as in the original neural-style.

ajhool commented 4 years ago

@ProGamerGov Nice work adding this support. It's unclear if you're still seeing the input image issue, but I was never able to invest much time into figuring out multigpu pytorch so I'm of no help here. I like the multidevice strategy idea, that's clever.

Following this chain [1] and [2], is it possible that the gpu 0 usage is the cuda driver allocation when loading pytorch? I think a really useful feature would be to add a background nvidia-smi watcher with a high sample rate that could produce a plot of the memory usage, the native pytorch utils apparently don't do well [2]. There is a spike in the beginning of the program that isn't captured by the steady-state memory usage and it can be large enough to crash the program. It might also be useful in the dev/debug phase to determine what program is running on each gpu. I was never able to capture that entire memory usage profile (with the spike) in lua but maybe pytorch makes it easier using a library like [3]

It's been a long time since I've use python but something like

import time, threading, nvgpu, numpy

# 2d memory array for each gpu
# each column is a gpu and each row is a memory percentage sample
# for 50 samples on an 8 gpu machine, dimensions should be 50x8
memorySamples = []

# time sampled every 250 ms or so.
t = []

def updateGpuPlot():
    # map the gpu info array onto an array of the memory use percentages.
    memorySample = map(lambda mem: mem.mem_used_percent, nvgpu.gpu_info())
    numpy.concatenate(memorySamples, memorySample)
    t = t.append(time.time_ns() // 1000000) 

    threading.Timer(0.25, updateGpuPlot).start()

You could also try disabling gpu 0 and seeing what breaks.

[1] https://discuss.pytorch.org/t/attributeerror-modelparallelmodel-object-has-no-attribute-forward-pre-hooks/56463/4

[2] https://discuss.pytorch.org/t/memory-cached-and-memory-allocated-does-not-nvidia-smi-result/28420/2

[3] https://pypi.org/project/nvgpu/

ProGamerGov commented 4 years ago

nvgpu just uses nvidia-smi, and I think that I can replicate the behavior in an easier with this nvidia-smi command:

 nvidia-smi --query-gpu=timestamp,memory.used --format=csv -lms 50 | tee nvidia-smi.log

I got CPU device support working, and I'm not sure if I can reproduce the error while using a GPU and CPU for devices.

I think the way I used the dtype variable in my code, puts the input images on GPU:0, because it's the default GPU. But I move the inputs to their device afterwards, so I don't think that should matter?


In my experiments that I shared here: https://github.com/ProGamerGov/neural-style-pt/pull/20, GPU:0 went from 5549MiB to 4973MiB when I only had layer 1 on it. Sticking the model on cuda:6 (unused GPU) before splitting it to the correct GPUs caused usage on GPU:0 to go from 10309MiB to 10351MiB, and then when I decreased the total number of layers used by the model the usage stayed the same. If it was caused by how many GPUs being used, then one would expect usage to decrease.

I was thinking that maybe the .to(device) function had something to do with it, but I would think that decreasing the number of GPUs would show less usage, if this were the case.

ProGamerGov commented 4 years ago

@ajhool I found a memory tracking library called pytorch_memlab and used it to track the memory usage line by line in my code:

Here's the the short version: https://gist.github.com/ProGamerGov/0ab55d9b23bb409ca116188883f4a1fd

And here's the full line by line memory tracking output: https://gist.github.com/ProGamerGov/de7a8734e05011018d535385de31b034

And here's what tensors exist on cuda:0 (device 1/GPU:0) when the code is just about finished running: https://gist.github.com/ProGamerGov/8a44351c4fdbf1731b2cbd21b1b32d17

I used -gpu 0,c -multidevice_strategy 0 to make sure the code runs on the CPU while and making the GPU available to the code.

ProGamerGov commented 4 years ago

Looks like the anomalous GPU:0 memory usage comes from here, as the line by line memory usage by default only tells you what GPU:0 is using:

Line # Max usage   Peak usage diff max diff peak  Line Contents
===============================================================
   258    72.65M      102.00M   72.65M  102.00M      optimizer, loopVal = setup_optimizer(img)
   259     1.85G        2.09G    1.78G    1.99G      while num_calls[0] <= loopVal:
   260     1.85G        2.13G    0.00B   40.00M           optimizer.step(feval)

Here are the line by line memory usage results and what tensors exist when I'm using multiple GPUs: https://gist.github.com/ProGamerGov/e383bc19023d72022e5e426f4a0260af

ProGamerGov commented 4 years ago

@ajhool I figured it out! It was actually something that I have suspected initially, but I didn't make the connection until I saw the note here:

If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

I was converting my input tensors to cuda before running them though my model with: tensor.type('torch.cuda.FloatTensor'). When I split my model across multiple GPUs and converted the inputs to the right layers, the optimizer was somehow running both the stuff on the default GPU (GPU:0) and the stuff on the model separated across multiple GPUs.

I'm not sure if this is a bug that I should report to PyTorch?

Here's the fix: https://github.com/ProGamerGov/neural-style-pt/commit/d08b594a46701433bb18d8be544843d7fc549942

ProGamerGov commented 4 years ago

Now, all I need to figure out, is if I can turn this section of code:

        for mod in content_losses:
            loss += mod.loss.to(backward_device)
        for mod in style_losses:
            loss += mod.loss.to(backward_device)
        if params.tv_weight > 0:
            for mod in tv_losses:
                loss += mod.loss.to(backward_device) 

Back into just:

        for mod in content_losses:
            loss += mod.loss
        for mod in style_losses:
            loss += mod.loss
        if params.tv_weight > 0:
            for mod in tv_losses:
                loss += mod.loss

Edit: Maybe it's okay to leave the code like this? In some previous testing, I didn't notice any increase in terms of speed when I was only using a single GPU.

ProGamerGov commented 4 years ago

So, apparently the -seed parameter isn't working correctly now?

Though I changed nothing with the seed code, other than add torch.cuda.manual_seed_all(params.seed).

Edit: I removed torch.cuda.manual_seed(params.seed), and now it seems work correctly again.

ProGamerGov commented 4 years ago

I've merged the multi-gpu branch into both the master branch and the pip-master branch!

https://github.com/ProGamerGov/neural-style-pt/commit/ea001fbf723aba497d565da2f5fc64adcc22e6af

If you experience any issues with the new update, let me know!

ProGamerGov commented 4 years ago

I'm going to close this issue now, as the update seems to be working well without any issues. If you experience any issues with the multi-device feature, please make a new issue.

ProGamerGov commented 4 years ago

So, I was able to achieve of an -image_size of 4016 with L-BFGS, the default VGG-19 model, and 8 Tesla K80 GPUs:

Sat Oct  5 19:56:38 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:17.0 Off |                    0 |
| N/A   71C    P0    69W / 149W |  10253MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:18.0 Off |                    0 |
| N/A   54C    P0    76W / 149W |   9677MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:00:19.0 Off |                    0 |
| N/A   71C    P0    65W / 149W |  10061MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   58C    P0    74W / 149W |  10197MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   70C    P0   101W / 149W |   9453MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   53C    P0   105W / 149W |   5297MiB / 11441MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   72C    P0    62W / 149W |   7819MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   56C    P0    74W / 149W |   5107MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4851      C   python3                                    10236MiB |
|    1      4851      C   python3                                     9662MiB |
|    2      4851      C   python3                                    10046MiB |
|    3      4851      C   python3                                    10182MiB |
|    4      4851      C   python3                                     9438MiB |
|    5      4851      C   python3                                     5284MiB |
|    6      4851      C   python3                                     7806MiB |
|    7      4851      C   python3                                     5094MiB |
+-----------------------------------------------------------------------------+

This was the strategy I used:

-gpu 0,1,2,3,4,5,6,7 -multidevice_strategy 2,4,6,9,15,18,23

I wonder how much higher I could go with the Adam optimizer and a less memory demanding model like VGG-16, Channel Pruning, or NIN?