`run_summarization.py` isn't computing on GPU (but allocate memory)

abisee / pointer-generator

Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"

Other

2.18k stars 813 forks source link

`run_summarization.py` isn't computing on GPU (but allocate memory) #18

Open pltrdy opened 7 years ago

pltrdy commented 7 years ago

Strange fact. I've been trying to train summarization with default parameters and default dataset.

It turns out that the GPU is detected (2017-06-01 17:57:05.548650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)), some GPU memory is alocated for the process, but nothing is running on it (load: 0%). It sometimes peak at around 27% for very short time.

The GPU is a single 1080 with 8G RAM (4617MiB allocated by the process)

Any idea?

pedrobalage commented 7 years ago

@pltrdy, try to change the follow line https://github.com/abisee/pointer-generator/blob/master/run_summarization.py#L125 to default_device = tf.device('/gpu:0')

pltrdy commented 7 years ago

Thx @pedrobalage, I actually removed this block. I don't really understand why the device is hardcoded. Shouldn't it be left "default" (and then user can (un/)set CUDA_VISIBLE_DEVICE) or at least a parameter?

pedrobalage commented 7 years ago

@pltrdy, I didn't understand this decision too. This was the only way it worked for me. I also setup the CUDA_VISIBLE_DEVICE.

pltrdy commented 7 years ago

@pedrobalage thanks for sharing it, I didn't even noticed this line.

@abisee I forked, switched to python3 syntax & removed this line. My python 3 implementation wont be compatible because of bytes vs str differences so I'm not sure a PR would be welcomed for that. https://github.com/pltrdy/pointer-generator

ioana-blue commented 7 years ago

thanks for pointing this out @pltrdy removing that line allows the code to run on the gpu and it's going 20x faster. so you pretty much made my life go by 20 times faster 👍

tianjianjiang commented 7 years ago

Another related discussion can be found in #3

pltrdy commented 7 years ago

@ioana-blue In fact I just complained then @pedrobalage helped us ;)

@tianjianjiang thx for referencing it. Interesting to notice that it's the same GPU. Can't figure out how this may work on K40 (used by the author) and not on 1080 since there is enough RAM. Maybe CUDA/TF version have some effect here. No idea.

tianjianjiang commented 7 years ago

@pltrdy You're welcome and I can also confirm that changing it from CPU to GPU can be worse on K80 and GT 650M.

tianjianjiang commented 7 years ago

BTW, there's a recent post about this topic https://datascience.stackexchange.com/questions/19220/choosing-between-cpu-and-gpu-for-training-a-neural-network Also my GT 650M and K80 experience is usually a support to not use slow GPU with small RAM, and to be careful with GPU in virtualization (i.e. AWS EC2 and Google Cloud).

abisee commented 7 years ago

Hi everyone,

I'm afraid this will be a rather unsatisfactory answer, but the reason that we set default device to CPU is that our code is derived from the TextSum code, which does the same thing (see this line). Similarly, we have a few

with tf.device("/gpu:0"):

lines in model.py that are derived from similar lines in this file.

Personally I'm not sure what is the rationale for doing this (I think I tried removing these lines and didn't see a difference). But that's the code we ran for our experiments so we left it like that for reproducibility, in case it matters.

Edit: When we ran on our K40, the GPU-Util was non-zero so the GPU was being used.

ioana-blue commented 7 years ago

I'm new to everything related to this work, but learning. From what I read in the docs and looking at the code, this is my understanding.

https://www.tensorflow.org/tutorials/using_gpu

explains that with tf.device will try to impose to use a certain device for certain computations, instead of the default device chosen by tf (which prioritizes gpus over cpus as far as I understand). if the device specified cannot be allocated, you get an exception, UNLESS you use allow_soft_placement=True, which our code uses: https://github.com/abisee/pointer-generator/blob/master/util.py#L26

That's why you can see various behavior depending on where you run the code, what's running concurrently, etc.

Hope this explains some of the mystery.

Spandan-Madan commented 7 years ago

I'm not entirely sure if I was able to make it to run on GPU. @ioana-blue @abisee @pltrdy @pedrobalage @tianjianjiang How long is a training step taking you approximately? For me every training step is about 1.5 seconds, is that the speed you're getting on GPU or the CPU?

pltrdy commented 7 years ago

@Spandan-Madan I'm around 1.8sec/step, default parameters, nvidia 1080. The GPU usage might be sub-optimal as we can see it is not always 100% of load.

eduOS commented 7 years ago

It's better to switch to GPU, since the default setting would make the training extremely slow.

abisee commented 7 years ago

Hello again,

Though I'm sure I tried changing default_device several months ago and found no difference, I tried making the change again this week and it did make things faster. So we've now committed that change to the repo.