krico1 commented 2 years ago

Hi, so we are working on generating biobert embeddings for our project. When we run it on a single word it takes about a second or so. When we run on a list of 10,000 words, it either times out or takes upwards of hours to run. Is this normal? Below is how we are using it:

def load_biobert(self):

Load BioBERT model (for sentence-type embeddings)

    self.logger.info("Loading BioBERT model...")
    start = time.time()
    biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased')
    end = time.time()
    self.logger.info('done (BioBERT loading time: %.2fs seconds)', end - start)
    return biobert

def get_biobert_embeddings(self, strings): embedding_list = [] for string in strings: self.logger.debug("...Generating embedding for: %s", string) embedding_list.append(self.get_biobert_embedding(string)) return embedding_list

def get_biobert_embedding(self, string): embedding = self.biobert.predict(string, output_level='sentence', get_embeddings=True) return embedding.sentence_embedding_biobert.values[0]

C-K-Loan commented 2 years ago

Hi @krico1 large embeddings like biobert can be quite slow because of the large deep learning models used for it. But you can also achieve ~ 10x speedup by using NLU in GPU mode

All you need to do is set gpu=True and make sure the GPU is available to Tesnroflow beforehand. Then you can just call the following to get the GPU pipe nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)

See this notebook as a reference.

Also note: If you have a large dataset at hand, it will be faster to feed NLU all the data at once instead of one by one

MargheCap commented 2 years ago

@C-K-Loan Hi! Unfortunately, I am not able to obtain the embeddings (even when adding get_embeddings= True). I tried with multiple models, and by including other parameters but with no success. In particular, nlu.load(biobert).predict("random sentence", output_level='token', get_embeddings= True) does not give the expected output, I thought the column was being dropped so I added drop_irrelevant_cols= False but still no success.

thank you!

raven44099 commented 1 year ago

@C-K-Loan I have the same problem as @MargheCap . I assume it has something to do with how we install the nlu package. Could you share how you install it?

With my installation (below), I get this rather slow calculation:

And I checked the GPU visibility to Tensorflow:

import tensorflow as tf
tf.config.list_physical_devices('GPU')

gives --> [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

For installation, I used:

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
pipe = nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)

I used this installation because it was proposed in this colab-sheet: https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_BERT_sentence_embeddings_and_t-SNE_visualization_Example.ipynb#scrollTo=rBXrqlGEYA8G Furthermore, the _quick_start_googlecolab.ipynb brought forth here ( https://nlp.johnsnowlabs.com/docs/en/install#google-colab-notebook ) utilises from sparknlp.pretrained import PretrainedPipeline, but I don't know how to load it. Using pipe = PretrainedPipeline('en.embed_sentence.biobert.pmc_base_cased', gpu=True) gives an errer: ...unexpected keyword argument 'gpu'

JohnSnowLabs / nlu

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word #103

Load BioBERT model (for sentence-type embeddings)