Reducing Memory Usage? - Githubissues

dsrubin commented 9 years ago

Hi,

My background is more in distributed systems than neural networks, but I managed to get this running with GPU support enabled and am just wondering if there is a way to reduce memory usage. My graphics card only has 4GB of memory which limits me to processing at most roughly 500x500 pixels.

When running another implementation of this algorithm on CPU I am able to use my entire 16gb of CPU memory which allows me to do much larger pictures, but much MUCH slower. Is there any way to reduce memory overhead via code changes or configuration changes? Is it possible to process only part of the image at once, then re-merge the pieces of the image at the end? I suspect doing that would cause some issues along the boarders.

Your insight is appreciated, thanks :)

andersbll commented 9 years ago

Hey, subdividing the image is a bit problematic since the image borders overlap quite a lot (after 5 downsamplings the 3x3 conv regions might span around 96 pixels ). Also, you need to ensure that the different subimages converge nicely together. Maybe by taking turns optimizing them a single step.

You can reduce the memory footprint by performing comparisons in fewer layers. I don't think there is more than a few percent to be squeezed out from code optimizations.

alexjc commented 9 years ago

I have a 4GB card too, and have been processing about 1440x768 images. There's some style image memory that does not get freed however (it should be resolution independent for style), but I haven't yet isolated the problem.

filmo commented 9 years ago

Hi, Is there a minimum size GPU required? I'm trying to run it on my MacBook Pro which has a 1GB GeForce 650M and am running into 'out-of-memory' issues.

philglau (master) neural_artistic_style $ python neural_artistic_style.py --subject images/tuebingen.jpg --style images/starry_night.jpg libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: ./include/cudarray/common.hpp:76: out of memory Abort trap: 6

As a secondary: it would be neat if deeppy could make use of 'some of the GPU' in circumstances where there's not enough memory to handle the entire model. My understanding is that the convolutions pretty much require a GPU. Not sure how feasible it would be to have a model that moves 'back and forth' from a CPU to GPU, but might be better than a system that can't take advantage of any portion of the GPU. (The current all or nothing approach).

Might try this on an Amazon AWS instance. If you have tried it there before, please let me know if that's a good secondary option. Thank you.

alexjc commented 9 years ago

The minimum GPU needs to be enough to host the neural network weights, but 1GB should be enough. Just lower the resolution of your images.

Hybrid CPU/GPU computation is not trivial to implement in a single model, not sure very many frameworks do it.

filmo commented 9 years ago

I installed gfxStatus to help force the MacBook Pro run on integrated graphics card while freeing up CUDA GPU. That helped with the deeppy examples. (CIFAR-10 now runs whereas before I was also getting the out of memory error)

To get neural_artistic_style working, I had to down-rez the input and sample images to 50% of original size. Once I did that it worked on my GPU (1GB).

Is there a rough way to approximate how much memory a certain size image would take. I'm assuming the input and sample size directly affects the model size. For example, 100x100pixel image = X megabytes of GPU memory needed, etc.

Related question: does the 'sample' image take the same amount of space in the model and the 'input' image?

I had initially assumed that the problem was with loading the large imagenet-vgg-verydeep-19 mat.

Thanks for the quick response.

andersbll commented 9 years ago

filmo: The memory usage blows up as the images are propagated through the network. The exact formula for calculating the memory footprint isn't simple because it depends on the number of feature channels, the downsampling and how deep in the network the images are compared.

I don't quite understand your question regarding the sample image vs. the input image. But I think the answer is yes. The two images must have the same size. The style image can have any size.

john-science commented 8 years ago

@andersbll For my use-case, both my content images and styles are grayscale. Is there any way for me to speed this process up (or save graphics card RAM) by altering your process to ignore color information?

Thanks for the wonderful learning tool!

andersbll commented 8 years ago

@theJollySin: Unfortunately, I don't think that removing 2 channels in the input layer will save you a lot of memory relative to the total memory required of the method. For example, The 2nd and 3rd layers contain 64 channels each. :)

errolgr commented 8 years ago

@andersbll Hey, great piece of work here. Unfortunately I've ran into some memory issues. I have run this project on two AWS instances, one with a K520 (4GB card) which I was able to obtain a 800x800 render & with another gpu instance 4 x K520 (4GB) I was only able to render 800x800 images still . Why am I running into issues, I'm rendering the images with nearly 16GB of video card memory? Does this have to do with virtualization? ValueError: an illegal memory access was encountered terminate called after throwing an instance of 'std::runtime_error' what(): ./include/cudarray/common.hpp:95: an illegal memory access was encountered Aborted (core dumped)

andersbll commented 8 years ago

@errolgr: Alas, the method runs on 1 GPU only meaning that you cannot utilize the memory on the remaining 3 GPUs. I'm not exactly sure what causes the error message you list. I would have expected an out of memory error.

FabienLavocat commented 8 years ago

@errolgr @andersbll I the the error 'an illegal memory access was encountered' a lot when the image is "too large" for the code. I am running the code against a GeForce 980 Ti with 6GB of memory. But if I feed the code with an image that is over 800x800 I will get this error from Cudarray. I've also noticed the memory used by the code is never over 2GB using the command nvidia-smi

andersbll commented 8 years ago

@FabienLavocat : Ok! I don't know what causes that error - but it is probably safe to assume that it is caused by a lack of memory then. Thanks.

Maybe what you are observing with nvidia-smi is less than the peak memory footprint which only occurs briefly just before the process dies. :)

errolgr commented 8 years ago

@andersbll is there a way we could implement model parallelism with this method? This way we could have multiple GPU's run on a task.

alexjc commented 8 years ago

It's easier (and probably quicker) to have different GPUs doing different images, rather than multiple GPUs on the same image.

errolgr commented 8 years ago

@alexjc Yes I agree with you, however I want to allocate multiple GPU memory to one image, so I can process larger images.

errolgr commented 8 years ago

@alexjc What additional configurations did you make to be able to process 1440 x768 images? I'm pushing a 980ti with 6GB of memory and am only able to achieve 800x800.

alexjc commented 8 years ago

I've been using Theano after Anders' library, and it's easier to maximize memory usage if that's your concern. (See the Lasagne recipe, or the Keras one.)

For this code, I think I had to manually free some buffers once they are no longer used.

filmo commented 8 years ago

I can get 1056 x 1600 on a 980ti pretty consistently.

I use a separate card to drive my monitors. If you're also using the 980ti for driving your monitors, you're probably losing 200 to 300 MB of space.

nvidia-smi -lms 500

To monitor the 980ti every 1/2 second to see how it's being utilized. Generally what happens is that there are 'peaks' in the process where all of a sudden the usage will jump dramatically for just a second or so. Thus with my 1056 x 1600 images, I'm typically in the 4800 to 5200 range during most of the process, but if I pay very close attention I can see it peaking close to 5800 or 5900 MB. If nvidia-smi happens to sample right when you go 'out-of-memory' you might see it report a peak usage number above 6000 MB on the 980ti

FabienLavocat commented 8 years ago

@filmo that's very interesting, I have also a 980ti, but can't get close to that number. The size of your subject images are about 1056x1600, what about the style image?

errolgr commented 8 years ago

@filmo Could you share your config? Are you using ADAM? Also which model are you using? Even if I were to disable my monitor, I would assume that I could barely get over 1000px as I sometimes cap at 900x900.

@alexjc I gave Lasagne a try, was able to get up to about 1300x900. Mind sharing some insights between lasagne and keros?

errolgr commented 8 years ago

@FabienLavocat I'm also noticing the issue you were having before. I have a continuous check on nvidia-smi and noticed that the code never exceeds 2.5GB. Had the same issue with my 4GB card ~ stuck at 2.5GB against the code. @andersbll do you know what could be causing this?

andersbll commented 8 years ago

@andersbll is there a way we could implement model parallelism with this method? This way we could have multiple GPU's run on a task.

Unfortunately, there is not a lot of model parallelism to be exploited as the forward-backward pass is sequential. One could imagine subdividing the image and processing each part separately. But then you run into some nasty border effects when merging the images again.

andersbll commented 8 years ago

@errolgr :

[..], however I want to allocate multiple GPU memory to one image, so I can process larger images.

If you want to run the code on large images with little GPU memory, you might consider trying to transfer temporary arrays to host memory. I'm pretty sure this can be hacked on to the Convolution class somehow. :)

EDIT: I had forgotten that convolution doesn't need temporary arays. However, the ReLU layers does and this should allow you to cache the temporary array to host memory

Just to clarify: the temporary arrays are those created during the forward pass to be used again for the backwards pass.

andersbll commented 8 years ago

@errolgr :

I have a continuous check on nvidia-smi and noticed that the code never exceeds 2.5GB. Had the same issue with my 4GB card ~ stuck at 2.5GB against the code. @andersbll do you know what could be causing this?

Can you accurately measure peak memory consumption using nvidia-smi? If not, I would assume that the method dies right after having allocated a bunch of memory. This happens in a split second and I assume nvidia-smi doesn't catch it.

filmo commented 8 years ago

@FabienLavocat my source images range in size. One of my larger sources is 1242 x 1500 and works when applied against a 1600 x 1056 image. I just reran it and it's ranging between 3200 and 5400 MB with the average around 4200 MB (I'm just eyeballing nvidia-smi -lms 200 ) There are probably also peaks much closer to 6000MB as this is about the limit of photo size I can push. I haven't done an exact optimization, but as I pushed the image size up much more it would occasionally crap out.

Some images seem to require more data than others given the same dimensions, but I might just be imagining this. In general the target image size seems to matters more that the source image size. Again, not sure if this is true or not.

The temp arrays might account for the 'spikeness' that's hard to observe on nvidia-smi. I don't know of a utility that would allow me to monitor memory usage on the card and dump to a text file with higher time precision. If there is such a tool, let me know.

mirzman commented 8 years ago

Hi, I run: ./neural_artistic_style.py --network ../imagenet-vgg-verydeep-19.mat --iterations 201 --subject images/chern_s9.jpg --style images/starry_night.jpg --output o9.png

with 769x769 size image chern_s9.jpg (768x768 works ok)

and it failes:

  File "./neural_artistic_style.py", line 139, in <module>
    run()
  File "./neural_artistic_style.py", line 131, in run
    cost = np.mean(net.update())
  File "/home/amirzoyan/deepart/neural_artistic_style/style_network.py", line 145, in update
    diff = gram_matrix(x_feats[l]) - self.style_grams[l]
  File "/home/amirzoyan/deepart/neural_artistic_style/style_network.py", line 40, in gram_matrix
    gram = ca.dot(feats, feats.T)
  File "/usr/local/lib/python2.7/dist-packages/cudarray-0.1.dev0-py2.7-linux-x86_64.egg/cudarray/linalg.py", line 45, in dot
    out = cudarray.empty(out_shape, dtype=a.dtype)
  File "/usr/local/lib/python2.7/dist-packages/cudarray-0.1.dev0-py2.7-linux-x86_64.egg/cudarray/cudarray.py", line 246, in empty
    return ndarray(shape, dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/cudarray-0.1.dev0-py2.7-linux-x86_64.egg/cudarray/cudarray.py", line 36, in __init__
    self._data = ArrayData(self.size, dtype, np_data)
  File "cudarray/wrap/array_data.pyx", line 16, in cudarray.wrap.array_data.ArrayData.__init__ (./cudarray/wrap/array_data.cpp:1465)
  File "cudarray/wrap/cudart.pyx", line 12, in cudarray.wrap.cudart.cudaCheck (./cudarray/wrap/cudart.cpp:816)
ValueError: an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
  what():  ./include/cudarray/common.hpp:95: an illegal memory access was encountered

there is enough gpu memory: nvidia-smi -lms 1 | grep 12287MiB | awk '{print $9, $10, $11}' | uniq gets "3239MiB / 12287MiB" at top.

what's the problem? model limitations? cuda problem?

andersbll commented 8 years ago

Strange, this works fine for me (I reshaped tuebingen.jpg to 769x769). Are you using cuDNN?

mirzman commented 8 years ago

yes, I set CUDNN_ENABLED=1 I reshaped tuebingen.jpg to 769x769, it also failes in my case...

neuralisator commented 8 years ago

I encountered the exact same issue as @mirzman after upgrading from a 4GB GTX 970 to a 12GB Titan X. If you use a larger image, the memory does fill up to more than 4GB, but then fails even though the 12GB limit does not appear to be hit (using the provided nvidia-smi line). It can in the end only generate images with the same maximum size as on the 4GB card. I have tried both CUDNN_ENABLED=1 and =0. Note that the neural-style algorithm works fine and uses the available GPU memory. Any info i can provide to help fixing this problem? Apart from that, awesome work, this is the best implementation i have seen so far.

neuralisator commented 8 years ago

To elaborate a little: I used CUDArray before without setting CUDNN_ENABLED when compiling it, and I used the cudnn5 version with the respective flag set. Both die with the reported error message. The error occurs when it's updating layer "deeppy.feedforward.activation_layers.ReLU". I also want to say that the test with 769x769 pixels may or may not work for two users with the same amount of GPU RAM (4GB), depending on how much the OS uses for display. The upper size limit is a bit borderline, I've had images with 800²px working. So to really make sure, you'd have to run it on a > 4GB card and a higher image resolution to see whether you run into the RAM limit or a software problem.

oscarriddle commented 8 years ago

@neuralisator I encountered this issue as well while using GTX1080 8GB RAM, and cudnn5.1. Env almost similar with you. For the very first try, I wanted to reproduce the example work but failed because of illegal memory access. But even if I use a very small reso image like 260x200, this error will pops too. So am afraid this is not a really insufficiency memory issue. At least for someones like me.

00fq00 commented 8 years ago

@errolgr Hey, the same problem with you, I have also run this project on two AWS instances, one with a K520 (4GB card) , another gpu instance 4 x K520 (4GB) when I use a image about 800*800px, ValueError: an illegal memory access was encountered terminate called after throwing an instance of 'std::runtime_error' what(): ./include/cudarray/common.hpp:95: an illegal memory access was encountered Aborted (core dumped) Have you solved the problem? Could you tell me?

andersbll / neural_artistic_style

Reducing Memory Usage? #21