Slowness using a NVIDIA Tesla P100 with the Quora dataset

set92 commented 6 years ago

I'm trying to rerun the code from here https://github.com/aswalin/Kaggle/blob/master/Quora.ipynb because I thought maybe my way of doing the code had a bad performance. I only changed the way of loading the model and I use

import torch
from models import InferSent

V = 1
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))
infersent.set_w2v_path('./data/glove.840B.300d.txt')
infersent = infersent.cuda()

But when it gets to the .encode() part, it starts working, and after ~3h it end up saying that it will take another 90h to finish using tqdm.

I already checked and is using the GPU, so not sure where is the problem. Is something related to how the model was distributed before? (with the .pickle file)

EDIT: With a bsize of 64, I have some spikes of 260 sentence/s, but normally is more common to get between 100 and 200 sentence/s, but I suppose later the performance will fall, because is when the ETA starts to increase.

josauder commented 6 years ago

I also experience slowness - I use models.py to encode sentences from SNLI using a single GeForce GTX 980 on a machine with nothing else running on it.

The code I use for loading the model is identical to OPs. When running encode with verbose=True and batches of 128 I get:

Speed : 82.8 sentences/s (gpu mode, bsize=True)

This is nowhere near the promised 1000 sentences/s

I have also have the fixed version from this issue, my models.py reads:

    def is_cuda(self):
        # either all weights are on cpu or they are on gpu
        return self.enc_lstm.bias_hh_l0.data.is_cuda

nvidia-smi reads:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 00000000:02:00.0 Off |                  N/A |
| 52%   79C    P2   116W / 210W |   1013MiB /  4042MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8098      C   /home/jonathan/ENV3/bin/python3             1002MiB |

aconneau commented 6 years ago

@set92 : I see in the code you pointed out that they're encoding sentences this way:

trn = pd.DataFrame()
for i in range(train.shape[0]):
    print(i)
    diff = infersent.encode([train["question1"][i]], tokenize=True) - infersent.encode([train["question2"][i]], tokenize=True)
    trn  = pd.concat((trn, pd.Series(diff[0])), axis=1)

Are "train["question1"][i]" batches of sentences or just single sentences? Also, how many sentences/s does the demo.ipynb output?

@josauder: Which sentences are you encoding? What does the demo.ipynb output in terms of nb of sentences/s?

set92 commented 6 years ago

@aconneau If you want to see better the dataset is here https://www.kaggle.com/c/quora-question-pairs/data, but basically "train["question1"][i]" is 1 sentence, and "train["question2"][i]" another sentence.

I just run the demo.ipynb and the output of encoding sentences is

Speed : 133.8 sentences/s (cpu mode, bsize=128) nb sentences encoded : 9815

Also, the demo is giving me a warning

/home/set_tobur/notebooks/notebooks/InferSent/encoder/models.py:222: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. sentences[stidx:stidx + bsize]), volatile=True)

But I think I fixed that in my code.

EDIT: I just saw that the demo was running in the CPU, I forced it with model = model.cuda() and now is Speed : 4661.0 sentences/s (gpu mode, bsize=128) nb sentences encoded : 9815

So I copied the models.py from the repo and used it with my code, but still the same output.

Nb words kept : 9/9 (100.0%) Speed : 248.1 sentences/s (gpu mode, bsize=64) Nb words kept : 33/33 (100.0%) Speed : 95.9 sentences/s (gpu mode, bsize=64) Nb words kept : 20/20 (100.0%) Speed : 147.3 sentences/s (gpu mode, bsize=64)

So, I suppose the problem is not passing to .encode() all the sentences at once? But I need to go 1 by one to comput their difference.

EDIT2: If I run infersent.encode(train["question1"], tokenize=True, verbose=True) I get an output

Nb words kept : 430669/433244 (99.4%) Speed : 2679.3 sentences/s (gpu mode, bsize=64)

So I suppose the problem is that, that I have to encode all the column question1 at once, although not sure how to do it for calculate the similarity of each couple of questions at the same time.

embeddings1 = infersent.encode(tqdm_notebook(train["question1"]), tokenize=True, verbose=True) embeddings2 = infersent.encode(tqdm_notebook(train["question2"]), tokenize=True, verbose=True) for i in tqdm_notebook(range(train.shape[0])): diff = embeddings1[i] - embeddings2[i] trn = pd.concat((trn, pd.Series(diff[0])), axis=1)

I suppose this will work, now I have to deal with the size of the embeddings in the memory, but that's another problem. Also was weird that I didn't get the error of volatile on my code, but yes on the demo.ipynb.

aconneau commented 6 years ago

Hi,

indeed, if you send sentences one-by-one to an LSTM on cuda, it won't exploit the power of your GPU. You need to send batches of sentences. When you send many sentences to "encode" (let's say 10k), it will first sort these sentences and then create batches of size (say) 128, to minimize the amount of padding and to reduce computational time. For your problem, I guess just encode all your question1 and your question2 and then make the subtraction of embeddings1 and embeddings2 as you mentioned and it will be much (much) faster. Thanks for pointing out the "volatile" stuff, I will fix it soon in a commit. This is in inheritance from the previous pytorch version.

Alexis

aconneau commented 5 years ago

Please re-open if needed. Encode sentences in batches and not one-by-one. Thanks Alexis

facebookresearch / InferSent

Slowness using a NVIDIA Tesla P100 with the Quora dataset #86