Open Femme-js opened 1 year ago
Here a few PubChem Bioassays: AID450 - nice description AID1079828 - very short one AID884 - concise
Word Embeddings derived from language models pre-trained on domain-specific corpora (biomedical articles) can help determine the additional intrinsic information carried out in textual bioassays data.
The simplest way to represent a word in numerical form is a one-hot encoder. However, this representation is very high-dimensional ( the vector is of the size of the vocabulary), and neither does it provide much information about the word meaning nor reveals any existing relationship between words.
Another way of representation is Word Embeddings, which is a low-dimensional space representing a high-dimensional vector (like one-hot encoding) in a compressed vector which can also provide relationships and similarity between words.
Language Models like BERT, Roberta, distill-Roberta, etc., are transformer-based language models offering an advantage over models like Word2Vec (where each word has a fixed representation regardless of the context within which the word appears). These models produce word representations dynamically informed by the words around them.
The primary difference between these language models originates from their architecture design, the way the pre-training is done, and the way input text is converted into tokens (which are part of a fixed vocabulary) which are then converted into embedding vectors.
The Biomed_Roberta_Base is adapted from RoBERTa-base to 2.68 million scientific papers from the Semantic Scholar corpus via continued pretraining.
This amounts to 7.55B tokens and 47GB of data.
The most accustomed way of using transformer-based language models is to use them for downstream tasks. However, as Transformer is a multi-layer structure that captures different levels of representations in different levels, it learns a rich hierarchy of linguistic information.
Surface features in lower layers, syntactic features in middle layers, and semantic features in higher layers.
With Biomed_Roberta_Base, we aim to extract the embeddings from the hidden states of the model.
/approve
@Femme-js ersilia model respository has been successfully created and is available at:
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
README.md
file to accurately describe your modelIf you have any questions, please feel free to open an issue and get support from the community!
Hi @miquelduranfrigola and @GemmaTuron !
Before opening a PR, I am attaching the link to colab notebook to give a guide of how outputs look like and will be adding all relevant descriptions here in the issue.
https://colab.research.google.com/drive/12Q5w6yrWPO7BryW8YqWOla8FMRVt6jbC?usp=sharing
Guide through the steps taken for extracting the embeddings:
WHY PYTORCH? Pytorch Interface is selected as it strikes a nice balance with the high-level APIs (easy to use).
USING HUGGINGFACE TRANSFORMERS LIBRARY Hugging Face is focused on Natural Language Processing(NLP) tasks, and the idea is not just to recognize words but to understand the meaning and context of those words. Companies like Hugging Face provide tools for enhanced interaction experience and communication close to the human experience.
Transformers package that hugging face provides a pool of pre-trained models for performing various tasks, which are easily customizable to our needs.
The Hugging Face Transformers provide two main outputs and three if configured accordingly.
• pooler output: (Second output from the transformer) Pooler output is the last layer hidden state of the first token of the sequence (classification token), further processed by a Linear layer and a Tanh activation function.
• last hidden state (First and default output from models)
This output is the sequence of hidden states at the output of the last layer. The output is usually [batch, maxlen, hidden_state]
, it can be narrowed down to [batch, 1, hidden_state]
for [CLS]
token, as the [CLS]
token is 1st token in the sequence. Here , [batch, 1, hidden_state]
can be equivalently considered as [batch, hidden_state]
.
One way to represent the last hidden state is through CLS Embeddings. Transformers are contextual models, and the idea is [CLS] token would have captured the entire context and would be sufficient for simple downstream tasks such as classification. Hence, for tasks such as classification using sentence representations, you can use [batch, hidden_state].
• hidden states (n layers, batch size, seq Len, hidden size) - Hidden states for all layers and for all ids.
Hi @GemmaTuron and @miquelduranfrigola !
While testing this model locally using the --repo_path command, I am getting the following error. The same error is given by checks too. I have been checking around this error but couldn't get anywhere. I would need some help here.
Hi @Femme-js Please, explain what you have been checking so we have a better idea of how to guide you
@Femme-js
Please see my comment from two weeks ago.
Model Name
Biomed Roberta Base
Model Description
BioMed-RoBERTa-base is a language model based on the RoBERTa-base with biomedical domain specific pretraining.
Slug
embeddings-extraction
Tag
Biomedical, Language Model
Publication
https://aclanthology.org/2020.acl-main.740/
Source Code
https://huggingface.co/allenai/biomed_roberta_base
License
Apache License 2.0