Closed RexCATCAT closed 4 years ago
Multi-GPU support is in the works, but I've run into issues with some PyTorch differences (while replicating the Lua/Torch7 code).
Hello ProGamerGov and thanks for your tool.
Any news about this support? Is it still planned? I would love to be able to use multiple GPUs too.
@LouSparfell I attempted to implement multi-GPU support here: https://github.com/ProGamerGov/neural-style-pt/tree/multi-gpu, but I've run into a bunch of issues. I don't have a readily available computer with multiple GPUs either, so I can't really test things.
You are welcome to submit a pull request if you are able to get it working!
@ProGamerGov I'd like to make an attempt at writing the multigpu code, would you have any free time to do a quick chat about what issues you have already run into so I don't trip over the same things?
@ajhool It's been a while since I was trying to get multigpu working, but my main issues were testing with multiple GPUs and dealing with the feval function (as PyTorch doesn't really have an exact equivalent of nn.GPU
).
Okay, thanks. I'll check it out and see how things go
@ajhool Alight, let me know how things go!
@ajhool I made progress on multi device support using this guide I found: https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
Here's the current multi-gpu/multi-device branch: https://github.com/ProGamerGov/neural-style-pt/tree/multi-gpu
Unfortunately, I am stuck at this error:
ubuntu@ip-Address:~/neural-style-pt$ python3 neural_style.py -gpu 0,1,2,3
VGG-19 Architecture Detected
Successfully loaded models/vgg19-d01eb7cb.pth
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
Setting up style layer 2: relu1_1
Setting up style layer 7: relu2_1
Setting up style layer 12: relu3_1
Setting up style layer 21: relu4_1
Setting up content layer 23: relu4_2
Setting up style layer 30: relu5_1
4
['cuda:0', 'cuda:1', 'cuda:2', 'cuda:3']
Capturing content targets
Traceback (most recent call last):
File "neural_style.py", line 466, in <module>
main()
File "neural_style.py", line 164, in main
net(content_image)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 538, in __call__
for hook in self._forward_pre_hooks.values():
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 591, in __getattr__
type(self).__name__, name))
AttributeError: 'ModelParallelModel' object has no attribute '_forward_pre_hooks'
ubuntu@ip-Address:~/neural-style-pt$
This is the setup function for multiple devices: https://github.com/ProGamerGov/neural-style-pt/blob/multi-gpu/neural_style.py#L303-L335
This is the class I created for the multi-device model: https://github.com/ProGamerGov/neural-style-pt/blob/multi-gpu/CaffeLoader.py#L110-L124
I got the model spread out across all of the selected devices successfully, but for some reason I can't run the input images through it. Any ideas?
Using multiple GPUs is now possible! Though I am not sure how I am going to deal with the feval function in a more elegant way. I am also not sure what if any change the new code will have on neural-style-pt's speed, because it will have redundant to.(device)
inside the feval function when using only a single device.
It should be possible to use both multiple GPUs and the CPU at the same time, or just a single GPU and the CPU at the same time. But I believe that will require some sort of conversion between tensor types in order to work.
Due to this ability to use multiple CPUs and GPUs, I renamed the -multigpu_strategy
parameter to -multidevice_strategy
. Other than that, it should work exact same as in the original neural-style.
@ProGamerGov Nice work adding this support. It's unclear if you're still seeing the input image issue, but I was never able to invest much time into figuring out multigpu pytorch so I'm of no help here. I like the multidevice strategy idea, that's clever.
Following this chain [1] and [2], is it possible that the gpu 0 usage is the cuda driver allocation when loading pytorch? I think a really useful feature would be to add a background nvidia-smi watcher with a high sample rate that could produce a plot of the memory usage, the native pytorch utils apparently don't do well [2]. There is a spike in the beginning of the program that isn't captured by the steady-state memory usage and it can be large enough to crash the program. It might also be useful in the dev/debug phase to determine what program is running on each gpu. I was never able to capture that entire memory usage profile (with the spike) in lua but maybe pytorch makes it easier using a library like [3]
It's been a long time since I've use python but something like
import time, threading, nvgpu, numpy
# 2d memory array for each gpu
# each column is a gpu and each row is a memory percentage sample
# for 50 samples on an 8 gpu machine, dimensions should be 50x8
memorySamples = []
# time sampled every 250 ms or so.
t = []
def updateGpuPlot():
# map the gpu info array onto an array of the memory use percentages.
memorySample = map(lambda mem: mem.mem_used_percent, nvgpu.gpu_info())
numpy.concatenate(memorySamples, memorySample)
t = t.append(time.time_ns() // 1000000)
threading.Timer(0.25, updateGpuPlot).start()
You could also try disabling gpu 0 and seeing what breaks.
[2] https://discuss.pytorch.org/t/memory-cached-and-memory-allocated-does-not-nvidia-smi-result/28420/2
nvgpu just uses nvidia-smi, and I think that I can replicate the behavior in an easier with this nvidia-smi command:
nvidia-smi --query-gpu=timestamp,memory.used --format=csv -lms 50 | tee nvidia-smi.log
I got CPU device support working, and I'm not sure if I can reproduce the error while using a GPU and CPU for devices.
I think the way I used the dtype variable in my code, puts the input images on GPU:0, because it's the default GPU. But I move the inputs to their device afterwards, so I don't think that should matter?
In my experiments that I shared here: https://github.com/ProGamerGov/neural-style-pt/pull/20, GPU:0 went from 5549MiB to 4973MiB when I only had layer 1 on it. Sticking the model on cuda:6
(unused GPU) before splitting it to the correct GPUs caused usage on GPU:0 to go from 10309MiB to 10351MiB, and then when I decreased the total number of layers used by the model the usage stayed the same. If it was caused by how many GPUs being used, then one would expect usage to decrease.
I was thinking that maybe the .to(device)
function had something to do with it, but I would think that decreasing the number of GPUs would show less usage, if this were the case.
@ajhool I found a memory tracking library called pytorch_memlab and used it to track the memory usage line by line in my code:
Here's the the short version: https://gist.github.com/ProGamerGov/0ab55d9b23bb409ca116188883f4a1fd
And here's the full line by line memory tracking output: https://gist.github.com/ProGamerGov/de7a8734e05011018d535385de31b034
And here's what tensors exist on cuda:0
(device 1/GPU:0) when the code is just about finished running: https://gist.github.com/ProGamerGov/8a44351c4fdbf1731b2cbd21b1b32d17
I used -gpu 0,c -multidevice_strategy 0
to make sure the code runs on the CPU while and making the GPU available to the code.
Looks like the anomalous GPU:0 memory usage comes from here, as the line by line memory usage by default only tells you what GPU:0 is using:
Line # Max usage Peak usage diff max diff peak Line Contents
===============================================================
258 72.65M 102.00M 72.65M 102.00M optimizer, loopVal = setup_optimizer(img)
259 1.85G 2.09G 1.78G 1.99G while num_calls[0] <= loopVal:
260 1.85G 2.13G 0.00B 40.00M optimizer.step(feval)
Here are the line by line memory usage results and what tensors exist when I'm using multiple GPUs: https://gist.github.com/ProGamerGov/e383bc19023d72022e5e426f4a0260af
@ajhool I figured it out! It was actually something that I have suspected initially, but I didn't make the connection until I saw the note here:
I was converting my input tensors to cuda before running them though my model with: tensor.type('torch.cuda.FloatTensor')
. When I split my model across multiple GPUs and converted the inputs to the right layers, the optimizer was somehow running both the stuff on the default GPU (GPU:0) and the stuff on the model separated across multiple GPUs.
I'm not sure if this is a bug that I should report to PyTorch?
Here's the fix: https://github.com/ProGamerGov/neural-style-pt/commit/d08b594a46701433bb18d8be544843d7fc549942
Now, all I need to figure out, is if I can turn this section of code:
for mod in content_losses:
loss += mod.loss.to(backward_device)
for mod in style_losses:
loss += mod.loss.to(backward_device)
if params.tv_weight > 0:
for mod in tv_losses:
loss += mod.loss.to(backward_device)
Back into just:
for mod in content_losses:
loss += mod.loss
for mod in style_losses:
loss += mod.loss
if params.tv_weight > 0:
for mod in tv_losses:
loss += mod.loss
Edit: Maybe it's okay to leave the code like this? In some previous testing, I didn't notice any increase in terms of speed when I was only using a single GPU.
So, apparently the -seed
parameter isn't working correctly now?
Though I changed nothing with the seed code, other than add torch.cuda.manual_seed_all(params.seed)
.
Edit: I removed torch.cuda.manual_seed(params.seed)
, and now it seems work correctly again.
I've merged the multi-gpu branch into both the master branch and the pip-master branch!
https://github.com/ProGamerGov/neural-style-pt/commit/ea001fbf723aba497d565da2f5fc64adcc22e6af
If you experience any issues with the new update, let me know!
I'm going to close this issue now, as the update seems to be working well without any issues. If you experience any issues with the multi-device feature, please make a new issue.
So, I was able to achieve of an -image_size
of 4016 with L-BFGS, the default VGG-19 model, and 8 Tesla K80 GPUs:
Sat Oct 5 19:56:38 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:17.0 Off | 0 |
| N/A 71C P0 69W / 149W | 10253MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:00:18.0 Off | 0 |
| N/A 54C P0 76W / 149W | 9677MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 00000000:00:19.0 Off | 0 |
| N/A 71C P0 65W / 149W | 10061MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 00000000:00:1A.0 Off | 0 |
| N/A 58C P0 74W / 149W | 10197MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 00000000:00:1B.0 Off | 0 |
| N/A 70C P0 101W / 149W | 9453MiB / 11441MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 00000000:00:1C.0 Off | 0 |
| N/A 53C P0 105W / 149W | 5297MiB / 11441MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 00000000:00:1D.0 Off | 0 |
| N/A 72C P0 62W / 149W | 7819MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 56C P0 74W / 149W | 5107MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4851 C python3 10236MiB |
| 1 4851 C python3 9662MiB |
| 2 4851 C python3 10046MiB |
| 3 4851 C python3 10182MiB |
| 4 4851 C python3 9438MiB |
| 5 4851 C python3 5284MiB |
| 6 4851 C python3 7806MiB |
| 7 4851 C python3 5094MiB |
+-----------------------------------------------------------------------------+
This was the strategy I used:
-gpu 0,1,2,3,4,5,6,7 -multidevice_strategy 2,4,6,9,15,18,23
I wonder how much higher I could go with the Adam optimizer and a less memory demanding model like VGG-16, Channel Pruning, or NIN?
Fantastic work! Manage to get it to run on win10.
Just curious is it capable of the -multigpu argument like the one that runs on torch7?