ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

🦠 Model Request: Extraction of Word Embeddings for textual bioassays data using Biomed_Roberta_Base #578

Open Femme-js opened 1 year ago

Femme-js commented 1 year ago

Model Name

Biomed Roberta Base

Model Description

BioMed-RoBERTa-base is a language model based on the RoBERTa-base with biomedical domain specific pretraining.

Slug

embeddings-extraction

Tag

Biomedical, Language Model

Publication

https://aclanthology.org/2020.acl-main.740/

Source Code

https://huggingface.co/allenai/biomed_roberta_base

License

Apache License 2.0

GemmaTuron commented 1 year ago

Here a few PubChem Bioassays: AID450 - nice description AID1079828 - very short one AID884 - concise

Femme-js commented 1 year ago

Word Embeddings derived from language models pre-trained on domain-specific corpora (biomedical articles) can help determine the additional intrinsic information carried out in textual bioassays data.

What are Word Embeddings?

The simplest way to represent a word in numerical form is a one-hot encoder. However, this representation is very high-dimensional ( the vector is of the size of the vocabulary), and neither does it provide much information about the word meaning nor reveals any existing relationship between words.

Another way of representation is Word Embeddings, which is a low-dimensional space representing a high-dimensional vector (like one-hot encoding) in a compressed vector which can also provide relationships and similarity between words.

What is Biomed_Roberta_Base?

Language Models like BERT, Roberta, distill-Roberta, etc., are transformer-based language models offering an advantage over models like Word2Vec (where each word has a fixed representation regardless of the context within which the word appears). These models produce word representations dynamically informed by the words around them.

The primary difference between these language models originates from their architecture design, the way the pre-training is done, and the way input text is converted into tokens (which are part of a fixed vocabulary) which are then converted into embedding vectors.

The Biomed_Roberta_Base is adapted from RoBERTa-base to 2.68 million scientific papers from the Semantic Scholar corpus via continued pretraining.

This amounts to 7.55B tokens and 47GB of data.

Femme-js commented 1 year ago

How to utilize the Transformer Representations Efficiently?

The most accustomed way of using transformer-based language models is to use them for downstream tasks. However, as Transformer is a multi-layer structure that captures different levels of representations in different levels, it learns a rich hierarchy of linguistic information.

Surface features in lower layers, syntactic features in middle layers, and semantic features in higher layers.

With Biomed_Roberta_Base, we aim to extract the embeddings from the hidden states of the model.

GemmaTuron commented 1 year ago

/approve

github-actions[bot] commented 1 year ago

New Model Repository Created! 🎉

@Femme-js ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos1086

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

Femme-js commented 1 year ago

Hi @miquelduranfrigola and @GemmaTuron !

Before opening a PR, I am attaching the link to colab notebook to give a guide of how outputs look like and will be adding all relevant descriptions here in the issue.

https://colab.research.google.com/drive/12Q5w6yrWPO7BryW8YqWOla8FMRVt6jbC?usp=sharing

Femme-js commented 1 year ago

Guide through the steps taken for extracting the embeddings:

Transformers package that hugging face provides a pool of pre-trained models for performing various tasks, which are easily customizable to our needs.

The Hugging Face Transformers provide two main outputs and three if configured accordingly.

pooler output: (Second output from the transformer) Pooler output is the last layer hidden state of the first token of the sequence (classification token), further processed by a Linear layer and a Tanh activation function.

last hidden state (First and default output from models)

This output is the sequence of hidden states at the output of the last layer. The output is usually [batch, maxlen, hidden_state], it can be narrowed down to [batch, 1, hidden_state] for [CLS] token, as the [CLS] token is 1st token in the sequence. Here , [batch, 1, hidden_state] can be equivalently considered as [batch, hidden_state].

One way to represent the last hidden state is through CLS Embeddings. Transformers are contextual models, and the idea is [CLS] token would have captured the entire context and would be sufficient for simple downstream tasks such as classification. Hence, for tasks such as classification using sentence representations, you can use [batch, hidden_state].

hidden states (n layers, batch size, seq Len, hidden size) - Hidden states for all layers and for all ids.

Femme-js commented 1 year ago

my.log

Hi @GemmaTuron and @miquelduranfrigola !

While testing this model locally using the --repo_path command, I am getting the following error. The same error is given by checks too. I have been checking around this error but couldn't get anywhere. I would need some help here.

GemmaTuron commented 1 year ago

Hi @Femme-js Please, explain what you have been checking so we have a better idea of how to guide you

GemmaTuron commented 1 year ago

@Femme-js

Please see my comment from two weeks ago.

Femme-js commented 1 year ago

Hi @GemmaTuron ! I have updated the issue 603.

This model needs an IO class to be defined for textual input.