Hi, Thanks for providing the tools to facilitate the embedding extraction using different gLMs. However, I have a question for the Nucleotide Transformer embedder function.
from bend.embedders import NucleotideTransformerEmbedder
# Load the embedder with a valid checkpoint name or path
embedder = NucleotideTransformerEmbedder('InstaDeepAI/nucleotide-transformer-500m-human-ref')
# Embed a list of sequences
embeddings = embedder.embed(sequences)
for e in embeddings:
print(e.shape)
# (1, 3, 1280)
# (1, 8, 1280)
# (1, 13, 1280)
in both cases, the max_tokens = 1000 by default. I was wondering if the discrepancy (dim1) is due to the chunking or the layer used ( last hidden state vs hidden state)? would the author clarify the reason for different implement and how to achieve the mean embeddings
Hi, Thanks for providing the tools to facilitate the embedding extraction using different gLMs. However, I have a question for the Nucleotide Transformer embedder function.
for example, when I was directly using their scripts from hugging face : https://huggingface.co/InstaDeepAI/nucleotide-transformer-500m-human-ref/blob/main/README.md with the model
500m_human_ref
using some example sequences with the length of 18, 33, 78 respectively (shown below):it returned
Embeddings shape: (3, 1000, 1280)
but when I used the code provided by BEND,
in both cases, the
max_tokens = 1000
by default. I was wondering if the discrepancy (dim1) is due to the chunking or the layer used (last hidden state
vshidden state
)? would the author clarify the reason for different implement and how to achieve the mean embeddings