JohnSnowLabs / nlu

1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems.
Apache License 2.0
853 stars 130 forks source link

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word #103

Open krico1 opened 2 years ago

krico1 commented 2 years ago

Hi, so we are working on generating biobert embeddings for our project. When we run it on a single word it takes about a second or so. When we run on a list of 10,000 words, it either times out or takes upwards of hours to run. Is this normal? Below is how we are using it:

def load_biobert(self):

Load BioBERT model (for sentence-type embeddings)

    self.logger.info("Loading BioBERT model...")
    start = time.time()
    biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased')
    end = time.time()
    self.logger.info('done (BioBERT loading time: %.2fs seconds)', end - start)
    return biobert

def get_biobert_embeddings(self, strings): embedding_list = [] for string in strings: self.logger.debug("...Generating embedding for: %s", string) embedding_list.append(self.get_biobert_embedding(string)) return embedding_list

def get_biobert_embedding(self, string): embedding = self.biobert.predict(string, output_level='sentence', get_embeddings=True) return embedding.sentence_embedding_biobert.values[0]

C-K-Loan commented 2 years ago

Hi @krico1 large embeddings like biobert can be quite slow because of the large deep learning models used for it. But you can also achieve ~ 10x speedup by using NLU in GPU mode

All you need to do is set gpu=True and make sure the GPU is available to Tesnroflow beforehand. Then you can just call the following to get the GPU pipe nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)

image

See this notebook as a reference.

Also note: If you have a large dataset at hand, it will be faster to feed NLU all the data at once instead of one by one

MargheCap commented 2 years ago

@C-K-Loan Hi! Unfortunately, I am not able to obtain the embeddings (even when adding get_embeddings= True). I tried with multiple models, and by including other parameters but with no success. In particular, nlu.load(biobert).predict("random sentence", output_level='token', get_embeddings= True) does not give the expected output, I thought the column was being dropped so I added drop_irrelevant_cols= False but still no success.

thank you!

raven44099 commented 1 year ago

@C-K-Loan I have the same problem as @MargheCap . I assume it has something to do with how we install the nlu package. Could you share how you install it?

With my installation (below), I get this rather slow calculation: image

And I checked the GPU visibility to Tensorflow:

import tensorflow as tf
tf.config.list_physical_devices('GPU')

gives --> [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

For installation, I used:

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
pipe = nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True) 

I used this installation because it was proposed in this colab-sheet: https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_BERT_sentence_embeddings_and_t-SNE_visualization_Example.ipynb#scrollTo=rBXrqlGEYA8G Furthermore, the _quick_start_googlecolab.ipynb brought forth here ( https://nlp.johnsnowlabs.com/docs/en/install#google-colab-notebook ) utilises from sparknlp.pretrained import PretrainedPipeline, but I don't know how to load it. Using pipe = PretrainedPipeline('en.embed_sentence.biobert.pmc_base_cased', gpu=True) gives an errer: ...unexpected keyword argument 'gpu'