Models running on CPU instead of GPU

PeterAJansen commented 4 years ago

Hi there,

I'm new to ktrain, and the example models that I'm running appear to be running (slowly) on the CPU instead of GPU. I'm using Ubuntu 18.04, and a Titan RTX with NVIDIA driver version 435.21.

1) I've tried two BERT demos: the huggingface demo, and the aclImdb demo. Both seem to have this issue.

2) Some Googling suggested adding this line to the top of the files, but it doesn't seem to have had an effect: import os os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"; os.environ["CUDA_VISIBLE_DEVICES"]="0"

3) Here is my requirements.txt file (hacked from another example): torch==1.4 torchtext==0.5 transformers==2.11.0 spacy==2.2.4 matplotlib gensim sklearn scikit-learn==0.21.3 scipy==1.4.1 ktrain

thanks, Peter

amaiya commented 4 years ago

Hi: You'll need to setup GPU support for TensorFlow 2.

To verify the GPU is being used, you can type nvidia-smi at command-line during training.

GPU will be used automatically - just make sure you have this at top of your script/notebook:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"

Also, DistilBERT trains in half the time as BERT and has nearly the same performance.

PeterAJansen commented 4 years ago

Hi Amaiya, Thanks for your help, and apologies for the miscommunication -- I already have CUDA (10.1) and NVIDIA drivers (435.21) setup, am reasonably comfortable with GPU computing, and can (for example) run the examples in the huggingface transformers library on the GPU without issue. The ktrain in requirements.txt also installs the TensorFlow 2 dependency, here's that subset of the packages installed with 'conda list':

tensorboard               2.1.1                    pypi_0    pypi
tensorflow                2.1.0                    pypi_0    pypi
tensorflow-datasets       3.2.0                    pypi_0    pypi
tensorflow-estimator      2.1.0                    pypi_0    pypi
tensorflow-metadata       0.22.2                   pypi_0    pypi

I do already have the environment variables set at the top of the file, but this unfortunately doesn't seem to have any effect on the code's choice of running on the CPU instead of the GPU:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0" 

import ktrain
from ktrain import text as txt

# load data
(x_train, y_train), (x_test, y_test), preproc = txt.texts_from_folder('data/aclImdb', maxlen=500, 
                                                                     preprocess_mode='bert',
                                                                     train_test_names=['train', 'test'],
                                                                     classes=['pos', 'neg'])

# load model
model = txt.text_classifier('bert', (x_train, y_train), preproc=preproc)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=6)

# find good learning rate
learner.lr_find()             # briefly simulate training to find good learning rate
learner.lr_plot()             # visually identify best learning rate

# train using 1cycle learning rate schedule for 3 epochs
learner.fit_onecycle(2e-5, 3)

Here is the output:

(act2) peter@neutronium:~/github/act2$ python test2.py 
detected encoding: utf-8
preprocessing train...
language: en
done. 1/1 :                                                                                                                                  
Is Multi-Label? False
preprocessing test...
language: en
done. 1/1 :                                                                                                                                  
Is Multi-Label? False
maxlen is 500
done.
simulating training for different learning rates... this may take a few moments...
Train on 25000 samples
Epoch 1/1024
   24/25000 [..............................] - ETA: 12:26:04 - loss: 0.9763 - accuracy: 0.4167

Here is the nvidia-smi output at this time:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           Off  | 00000000:0D:00.0  On |                  N/A |
| 41%   49C    P2    67W / 280W |   1341MiB / 24215MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1381      C   python                                       161MiB |
|    0      1432      G   /usr/lib/xorg/Xorg                           372MiB |
|    0      1645      G   /usr/lib/vmware/bin/vmware-vmx                53MiB |
|    0      3402      G   /usr/bin/krunner                              25MiB |
|    0      3404      G   /usr/bin/plasmashell                         102MiB |
|    0     11049      G   /usr/bin/obs                                 133MiB |
|    0     16934      G   /usr/bin/vlc                                   8MiB |
|    0     17841      G   ...uest-channel-token=13999864370167093987    63MiB |
|    0     31521      G   ...AAAAAAAAAAAAAAgAAAAAAAAA --shared-files   415MiB |
+-----------------------------------------------------------------------------+

Here is top, showing the process is instead using the CPUs:

top - 15:32:42 up 2 days,  5:36, 10 users,  load average: 20.13, 9.27, 3.69
Tasks: 565 total,   1 running, 392 sleeping,   0 stopped,   1 zombie
%Cpu(s): 70.9 us,  8.5 sy,  0.0 ni, 20.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13185091+total, 77732896 free, 17963680 used, 36154336 buff/cache
KiB Swap:  2097148 total,  2091260 free,     5888 used. 11183318+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                              
 1381 peter     20   0 26.852g 0.010t 347496 S  2478  8.0  50:59.66 python

PeterAJansen commented 4 years ago

Some other quick tests:

1) Setting the environment variables manually doesn't help:

peter@neutronium:~/github/act2$ printenv | grep CUDA
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=0

2) The above issues are on Python 3.7. Same issue with a fresh Conda Python 3.8 install with only the bare essentials (ktrain and it's immediate dependencies), which uses tensorflow 2.2.0 .

amaiya commented 4 years ago

Thanks for the extra information. I do see that this is definitely using the CPU. When you ran the transformers examples, are you sure you were running the TensorFlow examples and not the PyTorch examples? I've verified everything is working correctly on a local GPU as well as Google Colab, so it still seems like a TF/CUDA issue to me.

One thing you can try is to re-run your ktrain BERT example but add this to the top of your script:

os.environ["SUPPRESS_TF_WARNINGS"]="0"

ktrain suppresses a lot of TensorFlow warnings by default. This will allow you to see them. Are there are any warnings about CUDA?

Also, when you run this MNIST example below, does nvidia-smi show that it is using the GPU? Are there are any errors or warnings related to CUDA or GPU?

from __future__ import print_function
from tensorflow import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras import backend as K
batch_size = 128
num_classes = 10
epochs = 12
img_rows, img_cols = 28, 28
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test))
print(model.evaluate(x_test, y_test, verbose=0))

amaiya commented 4 years ago

PeterAJansen commented 4 years ago

I added some code to list the available GPUs, and it wasn't able to see them -- so I re-installed the drivers and it looks to be working now. This of course entirely on my end, and not an issue with ktrain. Thanks again for your help!

amaiya / ktrain

Models running on CPU instead of GPU #199