Open TejasAdsul opened 2 years ago
KeyBERT uses as a default model one from SentenceTransformers
. That package automatically selects a GPU if it can find one, so there is no need to select one yourself! Just use KeyBERT as is and you will be using the GPU if you have one installed.
Can I assign multi GPU for the Extraction?
@thtang This depends on the model that you are using, some support it and others do not. As a default, sentence-transformers
is used and will only use a single GPU. However, you can create custom back-ends that support this:
from keybert import KeyBERT
from keybert.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer
class CustomEmbedder(BaseEmbedder):
def __init__(self, embedding_model):
super().__init__()
self.embedding_model = embedding_model
# all available CUDA devices will be used
self.pool = self.embedding_model.start_multi_process_pool()
def embed(self, documents, verbose=False):
# Run encode() on multiple GPUs
embeddings = self.embedding_model.encode_multi_process(documents,
self.pool)
return embeddings
# Create custom backend and pass it to KeyBERT
model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
custom_embedder = CustomEmbedder(embedding_model=model)
kw_model = KeyBERT(model=custom_embedder)
Great it Works! Many thanks~
Is there any method to check/force for GPU usage? Similar to spacy.require_gpu(). I performed benchmark on CUDA machine and there was no any speed up for KeyBERT, while other models were improved.
@Amaimersion The GPU should automatically be used when you are using SentenceTransformers
in KeyBERT. If it is not being used, then there is likely something wrong with the environment in which you are working that does not recognize a CUDA-enabled pytorch version.
Initially I used it likes this:
model = KeyBERT("<hf_model_name>") # results in CPU usage
Then I tried this:
model = SentenceTransformer(
"<hf_model_name>",
device="cuda:0" # I also tried without this, still same result
)
hf_model = KeyBERT(model) # results in CPU usage
This prints cpu
instead of cuda:0
:
model = SentenceTransformer(
"<hf_model_name>",
device="cuda:0"
)
print("DEVICE", model.device)
hf_model = KeyBERT(model)
Based on this answer I tried this:
model = SentenceTransformer(
"<hf_model_name>",
device="cuda:0"
)
model.encode("test")
print("DEVICE", model.device) # now it prints "cuda:0"
hf_model = KeyBERT(model)
This also prints cuda:0
:
model = SentenceTransformer(
"<hf_model_name>",
device="cuda:0"
)
model = model.to("cuda:0")
print("DEVICE", model.device)
hf_model = KeyBERT(model)
But there was no any performance improvements (I tested it with benchmark).
Maybe it is because my code doesn't actually uses this .encode()
?
keywords = hf_model.extract_keywords(text, vectorizer=KeyphraseCountVectorizer(), stop_words=None, top_n=20)
Can model training also affect at this? I mean, maybe this model should be trained on GPU device instead of CPU device.
As for environment, pytorch recognizes CUDA because this "cuda:0" if torch.cuda.is_available() else "cpu"
results in cuda:0
.
@Amaimersion It is indeed strange that although your GPU is recognized, it does not seem to be used. Having said that, I would still advise starting from a completely fresh environment, installing KeyBERT, and then installing PyTorch through their installation page here. That would prevent any issues you might have with your environment. Could you try that out?
Also, do you perhaps have multiple GPUs?
Maybe it is because my code doesn't actually uses this
.encode()?
No, KeyBERT actually does use .encode()
to encode the documents so that should not be the issue.
Can model training also affect at this? I mean, maybe this model should be trained on GPU device instead of CPU device.
Depends on the model that was trained but since it is a pytorch-based model, it will almost always benefit from using a GPU.
But there was no any performance improvements (I tested it with benchmark).
How did you perform this benchmark? If you are passing a single document at a time or very short documents, there might be a chance that there is not much GPU-power necessary.
keywords = hf_model.extract_keywords(text, vectorizer=KeyphraseCountVectorizer(), stop_words=None, top_n=20)
The KeyphraseCountVectorizer
actually uses Spacy as a back-end, so it might be worthwhile to enable GPU with spacy.require_gpu.
Here are steps that you can use to reproduce my environment and code from scratch. You should use Ubuntu 20 with CUDA device. Actually you can just look at it to verify that everything is installed and configured properly.
Upgrade Ubuntu packages, install Python 3.8 packages, install CUDA toolkit:
sudo apt-get update
sudo apt-get upgrade -y
sudo apt-get install -y python3-pip python3-venv
sudo apt-get install -y linux-headers-$(uname -r)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
echo '' >> ~/.profile
echo 'if [ -d "/usr/local/cuda/bin/" ]; then' >> ~/.profile
echo ' PATH=/usr/local/cuda/bin${PATH:+:${PATH}}' >> ~/.profile
echo 'fi' >> ~/.profile
curl -LO https://github.com/ClementTsang/bottom/releases/download/0.6.8/bottom_0.6.8_amd64.deb
sudo dpkg -i bottom_0.6.8_amd64.deb
sudo reboot
After that verify that Python and CUDA are installed:
$ python3 --version
Python 3.8.10
$ pip3 --version
pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
$ nvidia-smi
Tue Jul 19 07:42:54 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 37C P8 14W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Install pip packages:
mkdir test
cd test
python3 -m venv venv
source venv/bin/activate
pip install transformers keybert spacy[cuda117] keyphrase_vectorizers
pip freeze > requirements.txt
cat requirements.txt
requirements.txt
content:
blis==0.7.8
catalogue==2.0.7
certifi==2022.6.15
charset-normalizer==2.1.0
click==8.1.3
commonmark==0.9.1
cupy-cuda117==10.6.0
cymem==2.0.6
fastrlock==0.8
filelock==3.7.1
huggingface-hub==0.8.1
idna==3.3
Jinja2==3.1.2
joblib==1.1.0
keybert==0.5.1
keyphrase-vectorizers==0.0.10
langcodes==3.3.0
MarkupSafe==2.1.1
murmurhash==1.0.7
nltk==3.7
numpy==1.23.1
packaging==21.3
pathy==0.6.2
Pillow==9.2.0
preshed==3.0.6
psutil==5.9.1
pydantic==1.9.1
Pygments==2.12.0
pyparsing==3.0.9
PyYAML==6.0
regex==2022.7.9
requests==2.28.1
rich==12.5.1
scikit-learn==1.1.1
scipy==1.8.1
sentence-transformers==2.2.2
sentencepiece==0.1.96
smart-open==5.2.1
spacy==3.4.0
spacy-alignments==0.8.5
spacy-legacy==3.0.9
spacy-loggers==1.0.3
spacy-transformers==1.1.7
srsly==2.4.3
thinc==8.1.0
threadpoolctl==3.1.0
tokenizers==0.12.1
torch==1.12.0
torchvision==0.13.0
tqdm==4.64.0
transformers==4.20.1
typer==0.4.2
typing-extensions==4.3.0
urllib3==1.26.10
wasabi==0.9.1
Create test script:
touch main.py
nano main.py
Insert this content:
import os
import timeit
from statistics import quantiles
import random
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from sentence_transformers import SentenceTransformer
import spacy
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
CACHE_DIR = "cache"
DEVICE = "cpu"
TEXT = [
"When To Use Generics. The Go 1.18 release adds a major new language feature: support for generic programming. In this article I'm not going to describe what generics are nor how to use them. This article is about when to use generics in Go code, and when not to use them. To be clear, I'll provide general guidelines, not hard and fast rules. Use your own judgement. But if you aren't sure, I recommend using the guidelines discussed here. Let's start with a general guideline for programming Go: write Go programs by writing code, not by defining types. When it comes to generics, if you start writing your program by defining type parameter constraints, you are probably on the wrong path. Start by writing functions. It's easy to add type parameters later when it's clear that they will be useful. One case is when writing functions that operate on the special container types that are defined by the language: slices, maps, and channels. If a function has parameters with those types, and the function code doesn't make any particular assumptions about the element types, then it may be useful to use a type parameter.",
"Get familiar with workspaces. Workspaces in Go 1.18 let you work on multiple modules simultaneously without having to edit go.mod files for each module. Each module within a workspace is treated as a root module when resolving dependencies. Previously, to add a feature to one module and use it in another module, you needed to either publish the changes to the first module, or edit the go.mod file of the dependent module with a replace directive for your local, unpublished module changes. In order to publish without errors, you had to remove the replace directive from the dependent module's go.mod file after you published the local changes to the first module. With Go workspaces, you control all your dependencies using a go.work file in the root of your workspace directory. The go.work file has use and replace directives that override the individual go.mod files, so there is no need to edit each go.mod file individually. You create a workspace by running go work init with a list of module directories as space-separated arguments. The workspace doesn't need to contain the modules you're working with. The init command creates a go.work file that lists modules in the workspace. If you run go work init without arguments, the command creates an empty workspace.",
"How Go Mitigates Supply Chain Attacks. Modern software engineering is collaborative, and based on reusing Open Source software. That exposes targets to supply chain attacks, where software projects are attacked by compromising their dependencies. Despite any process or technical measure, every dependency is unavoidably a trust relationship. However, the Go tooling and design help mitigate risk at various stages. There is no way for changes in the outside world—such as a new version of a dependency being published—to automatically affect a Go build. Unlike most other package managers files, Go modules don't have a separate list of constraints and a lock file pinning specific versions. The version of every dependency contributing to any Go build is fully determined by the go.mod file of the main module. Since Go 1.16, this determinism is enforced by default, and build commands (go build, go test, go install, go run, …) will fail if the go.mod is incomplete. The only commands that will change the go.mod (and therefore the build) are go get and go mod tidy. These commands are not expected to be run automatically or in CI, so changes to dependency trees must be made deliberately and have the opportunity to go through code review. This is very important for security, because when a CI system or new machine runs go build, the checked-in source is the ultimate and complete source of truth for what will get built. There is no way for third parties to affect that."
]
TOKENIZER_1 = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase", cache_dir=CACHE_DIR)
MODEL_1 = PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase", cache_dir=CACHE_DIR)
MODEL_1 = MODEL_1.to(DEVICE)
MODEL_2 = SentenceTransformer("all-MiniLM-L6-v2", cache_folder=CACHE_DIR, device=DEVICE)
MODEL_2 = MODEL_2.to(DEVICE)
KEYBERT_MODEL = KeyBERT(MODEL_2)
def main():
os.makedirs(CACHE_DIR, exist_ok=True)
spacy.require_gpu(0)
print(f"device - {DEVICE}")
result = benchmark(call_1)
print(f"result of call_1(): {result}")
result = benchmark(call_2)
print(f"result of call_2(): {result}")
def call_1():
text = random.choice(TEXT)
batch = TOKENIZER_1(
[text],
truncation=True,
padding="longest",
max_length=60,
return_tensors="pt"
)
batch = batch.to(DEVICE)
translated = MODEL_1.generate(
**batch,
max_length=60,
num_beams=10,
num_return_sequences=10,
temperature=1.5
)
tgt_text = TOKENIZER_1.batch_decode(translated, skip_special_tokens=True)
return tgt_text
def call_2():
text = random.choice(TEXT)
keywords = KEYBERT_MODEL.extract_keywords(text, stop_words=None, top_n=10, vectorizer=KeyphraseCountVectorizer())
return keywords
def benchmark(f):
times = timeit.Timer(f).repeat(repeat=30, number=1)
minS = round(min(times), 3)
maxS = round(max(times), 3)
avgS = round(sum(times) / len(times), 3)
percentiles = quantiles(times, n=100, method="inclusive")
p50 = round(percentiles[49], 3)
p75 = round(percentiles[74], 3)
p90 = round(percentiles[89], 3)
result = (f"min = {minS}s, max = {maxS}s, avg = {avgS}s, p50 = {p50}s, p75 = {p75}s, p90 = {p90}s")
return result
if __name__ == "__main__":
main()
As you can see from code, it will call every function 30 times and print aggregated metrics. It will randomly select text in order to avoid caching effects (I mean with one text every subsequent call after first one will be vert fast). call_1()
is just example to demonstrate other model. call_2()
is KeyBERT
code that I'm using. First I will test call_1()
, then call_2()
, not all together. I will use nvidia-smi
to see CUDA utilization graph, and bottom to see CPU utilization graph. At the beginning of every benchmark I will show content of main.py
, then run benchmark, then will show device utilization graph, then show results.
Screencast from 19-07-22 12:34:07.webm
p90 = 6s
As you can see CPU usage is at constant 50% (because pytorch uses half of CPU). GPU usage is at 0%. Results are too slow.
Note that now and further I will not take into account moment of models loading. You can see it at memory graph. I will look only at graph when models were loaded, which will be indicated by constant memory graph.
p90 = 0.44s
Screencast from 19-07-22 12:37:52.webm
GPU usage is at 80%. Results are 13x faster than on CPU. So this method clearly uses CUDA.
p90 = 0.78s
Screencast from 19-07-22 12:39:42.webm
CPU usage at 46%.
p90 = 0.72s
Screencast from 19-07-22 12:47:11.webm
CPU usage at 40%. GPU usage at 0-13%, but mostly on 0%. Result is faster by 1.1x. Actually it even may be not faster, just some deviation.
p90 = 0.06s
Screencast from 19-07-22 12:49:15.webm
I wanted to try with pure KeyBERT, so I removed custom vectorizer
. Result is 12x faster than result with vectorizer.
p90 = 0.26s
Screencast from 19-07-22 12:50:30.webm
It was slower by 4x.
Based on what I saw, it is vectorizer
itself is too slow? Or vectorizer
not uses CUDA? Using KeyBERT
with KeyphraseCountVectorizer
yields same results on CPU and GPU. But using KeyBERT
without KeyphraseCountVectorizer
yields different results, it was much faster on GPU.
Average length of test texts is 1200 symbols. I also tried 5k and 10k texts. Same results.
I don't sure, but it looks like KeyphraseCountVectorizer
uses CPU even on forced GPU, while KeyBERT
itself uses GPU.
@Amaimersion Let me start off by saying thank you for this extensive search into what exactly is happening here! You are one of the few that goes that much in-depth and it makes my work a whole lot easier 😄
There are a few small things that I have noticed but I believe most of it is indeed due to the KeyphraseCountVectorizer
which I will come back to in a bit.
pip install transformers keybert spacy[cuda117] keyphrase_vectorizers
After performing the above, it might be worthwhile to again check whether cuda is enabled. From your results, I am quite sure it is but just to be certain.
TOKENIZER_1 = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase", cache_dir=CACHE_DIR)
Thank you for this example, it indeed clearly indicates that GPU is working as it should in pytorch.
call_2() on GPU without vectorizer call_2() on CPU without vectorizer
Based on these, I think you are correct in stating that it is likely the KeyphraseCountVectorizer
. In my experiments, the model can be quite slow compared to a SentenceTransformer model for example. The processing it needs to do seems to require much more compute, so it is unsurprising that it slows down quite a bit. Having said that, you should still need to see some improvement when using a cuda-enabled GPU, which you clearly have.
I believe what is happening is a mixture of two things:
KeyphraseCountVectorizer
, as a default, actually uses a model optimized for CPU, namely en_core_web_sm
The lengths of the documents make it a bit misleading
This might sound a bit strange seeing as you got the same results regardless of the length of the texts. The misleading part here is that SentenceTransformers simply truncates the text if is passes a certain length but this same process does not happen with the KeyphraseCountVectorizer
. Thus, the GPU will only be used for a short time on the truncated text since embedding a single text is relatively quickly. This leads me to the following:
KeyphraseCountVectorizer
uses a CPU-optimized model
The default model in KeyphraseCountVectorizer
is Spacy's en_core_web_sm
which is optimized for the CPU and not the GPU. What likely happens is that after embedding the documents using the SentenceTransformer, which happens typically quite fast, the KeyphraseCountVectorizer
will take some time to generate the candidate keywords.
I think the solution here is to either stop using KeyphraseCountVectorizer
or, which I would highly advise testing out, use the en_core_web_trf
model instead. That model is, like SentenceTransformer, a transformer model and thereby benefits from using a GPU. This does not mean it will automatically be faster than en_core_web_sm
since they differ in size and speed.
p90 = 4s
p90 = 2.1s
p90 = 1.7s
p90 = 0.72s
Well, it looks like sm
is fastest variant in this case.
Summing up. KeyBERT uses GPU and reason of performance issue is something else. In my case it was external vectorizer
.
At the moment it looks that I have fastest variant, so I will not go deeper here. Perhaps in future I will stop using this model or just KeyphraseCountVectorizer
. Anyway, thank you @MaartenGr for detailed explanations!
Thank you for the extensive description! If you ever run into any other issues or if you have any concerns, please let me know. The more we know about potential bottlenecks and/or improvements the better.
Faced a similar problem here. Cuda enabled, I realized the GPU usage will spike once in a while and back to 0, while the CPU is in high load all the time. Very odd.
I followed the Quickstart guide.
from keybert import KeyBERT
kw_model = KeyBERT()
for i in tqdm(range(len(processed_docs))): # processed_docs is a list of string lists
keywords = kw_model.extract_keywords(processed_docs[i], keyphrase_ngram_range=(1, 3), stop_words='english', use_maxsum=True, nr_candidates=20, top_n=5)
Estimated 80+ hours to finish a total of 5000 lists of strings, about a million over sentences in total. Later, I changed to the following code with a specific model specified. The same thing happened as well.
from keybert import KeyBERT
kw_model = KeyBERT(model="all-mpnet-base-v2")
for i in tqdm(range(len(processed_docs))):
keywords = kw_model.extract_keywords(processed_docs[i], keyphrase_ngram_range=(1, 3), stop_words='english', use_maxsum=True, nr_candidates=20, top_n=5)
Note: I just downloaded the latest library today to try out.
@hengee I believe in your specific case two things might be happening. First, by setting keyphrase_ngram_range
to (1, 3) quite a number of candidate words will be created which can take some time to process. It should, however, still make sure that the GPU is working often. Second, and I believe this is the main issue here, is enabling use_maxsum
. Although it works quite well, it does take some time to process and does not make use of the GPU. So it is likely that that is what you are seeing. Instead, using MMR is quite a bit faster.
I'm trying to extract keywords and keyphrases from around 20k abstracts of journal articles. The FAQ mentions that it is recommended to use GPU with KeyBERT. However, I'm unclear how exactly to run the extract_keywords function on GPU. I tried
model = KeyBERT()
model.to(device)
but it says KeyBERT() has no attribute 'to'. I'd appreciate some help in implementing KeyBERT on GPU. Thanks!