How exactly to use GPU with KeyBERT?

TejasAdsul commented 2 years ago

I'm trying to extract keywords and keyphrases from around 20k abstracts of journal articles. The FAQ mentions that it is recommended to use GPU with KeyBERT. However, I'm unclear how exactly to run the extract_keywords function on GPU. I tried model = KeyBERT() model.to(device) but it says KeyBERT() has no attribute 'to'. I'd appreciate some help in implementing KeyBERT on GPU. Thanks!

MaartenGr commented 2 years ago

KeyBERT uses as a default model one from SentenceTransformers. That package automatically selects a GPU if it can find one, so there is no need to select one yourself! Just use KeyBERT as is and you will be using the GPU if you have one installed.

thtang commented 2 years ago

Can I assign multi GPU for the Extraction?

MaartenGr commented 2 years ago

@thtang This depends on the model that you are using, some support it and others do not. As a default, sentence-transformers is used and will only use a single GPU. However, you can create custom back-ends that support this:

from keybert import KeyBERT
from keybert.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer

class CustomEmbedder(BaseEmbedder):
    def __init__(self, embedding_model):
        super().__init__()
        self.embedding_model = embedding_model

        # all available CUDA devices will be used
        self.pool = self.embedding_model.start_multi_process_pool()

    def embed(self, documents, verbose=False):

        # Run encode() on multiple GPUs
        embeddings = self.embedding_model.encode_multi_process(documents, 
                                                               self.pool)
        return embeddings

# Create custom backend and pass it to KeyBERT
model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
custom_embedder = CustomEmbedder(embedding_model=model)
kw_model = KeyBERT(model=custom_embedder)

thtang commented 2 years ago

Great it Works! Many thanks~

Amaimersion commented 2 years ago

Is there any method to check/force for GPU usage? Similar to spacy.require_gpu(). I performed benchmark on CUDA machine and there was no any speed up for KeyBERT, while other models were improved.

MaartenGr commented 2 years ago

@Amaimersion The GPU should automatically be used when you are using SentenceTransformers in KeyBERT. If it is not being used, then there is likely something wrong with the environment in which you are working that does not recognize a CUDA-enabled pytorch version.

Amaimersion commented 2 years ago

Initially I used it likes this:

model = KeyBERT("<hf_model_name>") # results in CPU usage

Then I tried this:

model = SentenceTransformer(
    "<hf_model_name>",
    device="cuda:0" # I also tried without this, still same result
)
hf_model = KeyBERT(model) # results in CPU usage

This prints cpu instead of cuda:0:

model = SentenceTransformer(
    "<hf_model_name>",
    device="cuda:0"
)
print("DEVICE", model.device)
hf_model = KeyBERT(model)

Based on this answer I tried this:

model = SentenceTransformer(
    "<hf_model_name>",
    device="cuda:0"
)
model.encode("test")
print("DEVICE", model.device) # now it prints "cuda:0"
hf_model = KeyBERT(model)

This also prints cuda:0:

model = SentenceTransformer(
    "<hf_model_name>",
    device="cuda:0"
)
model = model.to("cuda:0")
print("DEVICE", model.device)
hf_model = KeyBERT(model)

But there was no any performance improvements (I tested it with benchmark).

Maybe it is because my code doesn't actually uses this .encode()?

keywords = hf_model.extract_keywords(text, vectorizer=KeyphraseCountVectorizer(), stop_words=None, top_n=20)

Can model training also affect at this? I mean, maybe this model should be trained on GPU device instead of CPU device.

As for environment, pytorch recognizes CUDA because this "cuda:0" if torch.cuda.is_available() else "cpu" results in cuda:0.

MaartenGr commented 2 years ago

@Amaimersion It is indeed strange that although your GPU is recognized, it does not seem to be used. Having said that, I would still advise starting from a completely fresh environment, installing KeyBERT, and then installing PyTorch through their installation page here. That would prevent any issues you might have with your environment. Could you try that out?

Also, do you perhaps have multiple GPUs?

Maybe it is because my code doesn't actually uses this .encode()?

No, KeyBERT actually does use .encode() to encode the documents so that should not be the issue.

Can model training also affect at this? I mean, maybe this model should be trained on GPU device instead of CPU device.

Depends on the model that was trained but since it is a pytorch-based model, it will almost always benefit from using a GPU.

But there was no any performance improvements (I tested it with benchmark).

How did you perform this benchmark? If you are passing a single document at a time or very short documents, there might be a chance that there is not much GPU-power necessary.

keywords = hf_model.extract_keywords(text, vectorizer=KeyphraseCountVectorizer(), stop_words=None, top_n=20)

The KeyphraseCountVectorizer actually uses Spacy as a back-end, so it might be worthwhile to enable GPU with spacy.require_gpu.

Amaimersion commented 2 years ago

Environment and code from scratch

Here are steps that you can use to reproduce my environment and code from scratch. You should use Ubuntu 20 with CUDA device. Actually you can just look at it to verify that everything is installed and configured properly.

Upgrade Ubuntu packages, install Python 3.8 packages, install CUDA toolkit:

sudo apt-get update 
sudo apt-get upgrade -y

sudo apt-get install -y python3-pip python3-venv

sudo apt-get install -y linux-headers-$(uname -r)

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda

echo '' >> ~/.profile
echo 'if [ -d "/usr/local/cuda/bin/" ]; then' >> ~/.profile
echo '    PATH=/usr/local/cuda/bin${PATH:+:${PATH}}' >> ~/.profile
echo 'fi' >> ~/.profile

curl -LO https://github.com/ClementTsang/bottom/releases/download/0.6.8/bottom_0.6.8_amd64.deb
sudo dpkg -i bottom_0.6.8_amd64.deb

sudo reboot

After that verify that Python and CUDA are installed:

$ python3 --version
Python 3.8.10

$ pip3 --version
pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

$ nvidia-smi
Tue Jul 19 07:42:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Install pip packages:

mkdir test
cd test

python3 -m venv venv
source venv/bin/activate

pip install transformers keybert spacy[cuda117] keyphrase_vectorizers
pip freeze > requirements.txt

cat requirements.txt

requirements.txt content:

blis==0.7.8
catalogue==2.0.7
certifi==2022.6.15
charset-normalizer==2.1.0
click==8.1.3
commonmark==0.9.1
cupy-cuda117==10.6.0
cymem==2.0.6
fastrlock==0.8
filelock==3.7.1
huggingface-hub==0.8.1
idna==3.3
Jinja2==3.1.2
joblib==1.1.0
keybert==0.5.1
keyphrase-vectorizers==0.0.10
langcodes==3.3.0
MarkupSafe==2.1.1
murmurhash==1.0.7
nltk==3.7
numpy==1.23.1
packaging==21.3
pathy==0.6.2
Pillow==9.2.0
preshed==3.0.6
psutil==5.9.1
pydantic==1.9.1
Pygments==2.12.0
pyparsing==3.0.9
PyYAML==6.0
regex==2022.7.9
requests==2.28.1
rich==12.5.1
scikit-learn==1.1.1
scipy==1.8.1
sentence-transformers==2.2.2
sentencepiece==0.1.96
smart-open==5.2.1
spacy==3.4.0
spacy-alignments==0.8.5
spacy-legacy==3.0.9
spacy-loggers==1.0.3
spacy-transformers==1.1.7
srsly==2.4.3
thinc==8.1.0
threadpoolctl==3.1.0
tokenizers==0.12.1
torch==1.12.0
torchvision==0.13.0
tqdm==4.64.0
transformers==4.20.1
typer==0.4.2
typing-extensions==4.3.0
urllib3==1.26.10
wasabi==0.9.1

Create test script:

touch main.py
nano main.py

Insert this content:

import os
import timeit
from statistics import quantiles
import random

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from sentence_transformers import SentenceTransformer
import spacy
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer

CACHE_DIR = "cache"
DEVICE = "cpu"
TEXT = [
    "When To Use Generics. The Go 1.18 release adds a major new language feature: support for generic programming. In this article I'm not going to describe what generics are nor how to use them. This article is about when to use generics in Go code, and when not to use them. To be clear, I'll provide general guidelines, not hard and fast rules. Use your own judgement. But if you aren't sure, I recommend using the guidelines discussed here. Let's start with a general guideline for programming Go: write Go programs by writing code, not by defining types. When it comes to generics, if you start writing your program by defining type parameter constraints, you are probably on the wrong path. Start by writing functions. It's easy to add type parameters later when it's clear that they will be useful. One case is when writing functions that operate on the special container types that are defined by the language: slices, maps, and channels. If a function has parameters with those types, and the function code doesn't make any particular assumptions about the element types, then it may be useful to use a type parameter.",

    "Get familiar with workspaces. Workspaces in Go 1.18 let you work on multiple modules simultaneously without having to edit go.mod files for each module. Each module within a workspace is treated as a root module when resolving dependencies. Previously, to add a feature to one module and use it in another module, you needed to either publish the changes to the first module, or edit the go.mod file of the dependent module with a replace directive for your local, unpublished module changes. In order to publish without errors, you had to remove the replace directive from the dependent module's go.mod file after you published the local changes to the first module. With Go workspaces, you control all your dependencies using a go.work file in the root of your workspace directory. The go.work file has use and replace directives that override the individual go.mod files, so there is no need to edit each go.mod file individually. You create a workspace by running go work init with a list of module directories as space-separated arguments. The workspace doesn't need to contain the modules you're working with. The init command creates a go.work file that lists modules in the workspace. If you run go work init without arguments, the command creates an empty workspace.",

    "How Go Mitigates Supply Chain Attacks. Modern software engineering is collaborative, and based on reusing Open Source software. That exposes targets to supply chain attacks, where software projects are attacked by compromising their dependencies. Despite any process or technical measure, every dependency is unavoidably a trust relationship. However, the Go tooling and design help mitigate risk at various stages. There is no way for changes in the outside world—such as a new version of a dependency being published—to automatically affect a Go build. Unlike most other package managers files, Go modules don't have a separate list of constraints and a lock file pinning specific versions. The version of every dependency contributing to any Go build is fully determined by the go.mod file of the main module. Since Go 1.16, this determinism is enforced by default, and build commands (go build, go test, go install, go run, …) will fail if the go.mod is incomplete. The only commands that will change the go.mod (and therefore the build) are go get and go mod tidy. These commands are not expected to be run automatically or in CI, so changes to dependency trees must be made deliberately and have the opportunity to go through code review. This is very important for security, because when a CI system or new machine runs go build, the checked-in source is the ultimate and complete source of truth for what will get built. There is no way for third parties to affect that."
]

TOKENIZER_1 = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase", cache_dir=CACHE_DIR)
MODEL_1 = PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase", cache_dir=CACHE_DIR)
MODEL_1 = MODEL_1.to(DEVICE)

MODEL_2 = SentenceTransformer("all-MiniLM-L6-v2", cache_folder=CACHE_DIR, device=DEVICE)
MODEL_2 = MODEL_2.to(DEVICE)
KEYBERT_MODEL = KeyBERT(MODEL_2)

def main():
    os.makedirs(CACHE_DIR, exist_ok=True)

    spacy.require_gpu(0)

    print(f"device - {DEVICE}")

    result = benchmark(call_1)
    print(f"result of call_1(): {result}")

    result = benchmark(call_2)
    print(f"result of call_2(): {result}")

def call_1():
    text = random.choice(TEXT)
    batch = TOKENIZER_1(
        [text],
        truncation=True,
        padding="longest",
        max_length=60,
        return_tensors="pt"
    )
    batch = batch.to(DEVICE)
    translated = MODEL_1.generate(
        **batch,
        max_length=60,
        num_beams=10,
        num_return_sequences=10,
        temperature=1.5
    )
    tgt_text = TOKENIZER_1.batch_decode(translated, skip_special_tokens=True)

    return tgt_text

def call_2():
    text = random.choice(TEXT)
    keywords = KEYBERT_MODEL.extract_keywords(text, stop_words=None, top_n=10, vectorizer=KeyphraseCountVectorizer())

    return keywords

def benchmark(f):
    times = timeit.Timer(f).repeat(repeat=30, number=1)

    minS = round(min(times), 3)
    maxS = round(max(times), 3)
    avgS = round(sum(times) / len(times), 3)

    percentiles = quantiles(times, n=100, method="inclusive")
    p50 = round(percentiles[49], 3)
    p75 = round(percentiles[74], 3)
    p90 = round(percentiles[89], 3)

    result = (f"min = {minS}s, max = {maxS}s, avg = {avgS}s, p50 = {p50}s, p75 = {p75}s, p90 = {p90}s")

    return result

if __name__ == "__main__":
    main()

Benchmarks

As you can see from code, it will call every function 30 times and print aggregated metrics. It will randomly select text in order to avoid caching effects (I mean with one text every subsequent call after first one will be vert fast). call_1() is just example to demonstrate other model. call_2() is KeyBERT code that I'm using. First I will test call_1(), then call_2(), not all together. I will use nvidia-smi to see CUDA utilization graph, and bottom to see CPU utilization graph. At the beginning of every benchmark I will show content of main.py, then run benchmark, then will show device utilization graph, then show results.

call_1() on CPU

Screencast from 19-07-22 12:34:07.webm

p90 = 6s

As you can see CPU usage is at constant 50% (because pytorch uses half of CPU). GPU usage is at 0%. Results are too slow.

Note that now and further I will not take into account moment of models loading. You can see it at memory graph. I will look only at graph when models were loaded, which will be indicated by constant memory graph.

call_1() on GPU

p90 = 0.44s

Screencast from 19-07-22 12:37:52.webm

GPU usage is at 80%. Results are 13x faster than on CPU. So this method clearly uses CUDA.

call_2() on CPU

p90 = 0.78s

Screencast from 19-07-22 12:39:42.webm

CPU usage at 46%.

call_2() on GPU

p90 = 0.72s

Screencast from 19-07-22 12:47:11.webm

CPU usage at 40%. GPU usage at 0-13%, but mostly on 0%. Result is faster by 1.1x. Actually it even may be not faster, just some deviation.

call_2() on GPU without vectorizer

p90 = 0.06s

Screencast from 19-07-22 12:49:15.webm

I wanted to try with pure KeyBERT, so I removed custom vectorizer. Result is 12x faster than result with vectorizer.

call_2() on CPU without vectorizer

p90 = 0.26s

Screencast from 19-07-22 12:50:30.webm

It was slower by 4x.

Result

Based on what I saw, it is vectorizer itself is too slow? Or vectorizer not uses CUDA? Using KeyBERT with KeyphraseCountVectorizer yields same results on CPU and GPU. But using KeyBERT without KeyphraseCountVectorizer yields different results, it was much faster on GPU.

Average length of test texts is 1200 symbols. I also tried 5k and 10k texts. Same results.

I don't sure, but it looks like KeyphraseCountVectorizer uses CPU even on forced GPU, while KeyBERT itself uses GPU.

MaartenGr commented 2 years ago

@Amaimersion Let me start off by saying thank you for this extensive search into what exactly is happening here! You are one of the few that goes that much in-depth and it makes my work a whole lot easier 😄

There are a few small things that I have noticed but I believe most of it is indeed due to the KeyphraseCountVectorizer which I will come back to in a bit.

pip install transformers keybert spacy[cuda117] keyphrase_vectorizers

After performing the above, it might be worthwhile to again check whether cuda is enabled. From your results, I am quite sure it is but just to be certain.

TOKENIZER_1 = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase", cache_dir=CACHE_DIR)

Thank you for this example, it indeed clearly indicates that GPU is working as it should in pytorch.

call_2() on GPU without vectorizer call_2() on CPU without vectorizer

Based on these, I think you are correct in stating that it is likely the KeyphraseCountVectorizer. In my experiments, the model can be quite slow compared to a SentenceTransformer model for example. The processing it needs to do seems to require much more compute, so it is unsurprising that it slows down quite a bit. Having said that, you should still need to see some improvement when using a cuda-enabled GPU, which you clearly have.

I believe what is happening is a mixture of two things:

The lengths of the documents make it a bit misleading
KeyphraseCountVectorizer, as a default, actually uses a model optimized for CPU, namely en_core_web_sm

The lengths of the documents make it a bit misleading
This might sound a bit strange seeing as you got the same results regardless of the length of the texts. The misleading part here is that SentenceTransformers simply truncates the text if is passes a certain length but this same process does not happen with the KeyphraseCountVectorizer. Thus, the GPU will only be used for a short time on the truncated text since embedding a single text is relatively quickly. This leads me to the following:

KeyphraseCountVectorizer uses a CPU-optimized model
The default model in KeyphraseCountVectorizer is Spacy's en_core_web_sm which is optimized for the CPU and not the GPU. What likely happens is that after embedding the documents using the SentenceTransformer, which happens typically quite fast, the KeyphraseCountVectorizer will take some time to generate the candidate keywords.

I think the solution here is to either stop using KeyphraseCountVectorizer or, which I would highly advise testing out, use the en_core_web_trf model instead. That model is, like SentenceTransformer, a transformer model and thereby benefits from using a GPU. This does not mean it will automatically be faster than en_core_web_sm since they differ in size and speed.

Amaimersion commented 2 years ago

call_2() on GPU with en_core_web_trf

p90 = 4s

call_2() on GPU with en_core_web_lg

p90 = 2.1s

call_2() on GPU with en_core_web_md

p90 = 1.7s

call_2() on GPU with en_core_web_sm

p90 = 0.72s

Well, it looks like sm is fastest variant in this case.

Summing up. KeyBERT uses GPU and reason of performance issue is something else. In my case it was external vectorizer.

At the moment it looks that I have fastest variant, so I will not go deeper here. Perhaps in future I will stop using this model or just KeyphraseCountVectorizer. Anyway, thank you @MaartenGr for detailed explanations!

MaartenGr commented 2 years ago

Thank you for the extensive description! If you ever run into any other issues or if you have any concerns, please let me know. The more we know about potential bottlenecks and/or improvements the better.

hengee commented 1 year ago

Faced a similar problem here. Cuda enabled, I realized the GPU usage will spike once in a while and back to 0, while the CPU is in high load all the time. Very odd.

I followed the Quickstart guide.

from keybert import KeyBERT
kw_model = KeyBERT()

for i in tqdm(range(len(processed_docs))): # processed_docs is a list of string lists
    keywords = kw_model.extract_keywords(processed_docs[i], keyphrase_ngram_range=(1, 3), stop_words='english', use_maxsum=True, nr_candidates=20, top_n=5)

Estimated 80+ hours to finish a total of 5000 lists of strings, about a million over sentences in total. Later, I changed to the following code with a specific model specified. The same thing happened as well.

from keybert import KeyBERT
kw_model = KeyBERT(model="all-mpnet-base-v2")
for i in tqdm(range(len(processed_docs))):
    keywords = kw_model.extract_keywords(processed_docs[i], keyphrase_ngram_range=(1, 3), stop_words='english', use_maxsum=True, nr_candidates=20, top_n=5)

Note: I just downloaded the latest library today to try out.

MaartenGr commented 1 year ago

@hengee I believe in your specific case two things might be happening. First, by setting keyphrase_ngram_range to (1, 3) quite a number of candidate words will be created which can take some time to process. It should, however, still make sure that the GPU is working often. Second, and I believe this is the main issue here, is enabling use_maxsum. Although it works quite well, it does take some time to process and does not make use of the GPU. So it is likely that that is what you are seeing. Instead, using MMR is quite a bit faster.

MaartenGr / KeyBERT