frederikkemarin / BEND

Benchmarking DNA Language Models on Biologically Meaningful Tasks
BSD 3-Clause "New" or "Revised" License
89 stars 13 forks source link

nucleotide transformer embedding #59

Closed hongruhu closed 1 month ago

hongruhu commented 1 month ago

Hi, Thanks for providing the tools to facilitate the embedding extraction using different gLMs. However, I have a question for the Nucleotide Transformer embedder function.

for example, when I was directly using their scripts from hugging face : https://huggingface.co/InstaDeepAI/nucleotide-transformer-500m-human-ref/blob/main/README.md with the model 500m_human_ref using some example sequences with the length of 18, 33, 78 respectively (shown below):

sequences = ["ATTCCGATTCCGATTCCG", 
"ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT", 
"ATTCCGAAATCGCTGACCGATCGTACGAAAATTTCTCTCTCTCTCTGAGATCGATCGATCGATATCTCTCGAGCTAGC"]

it returned Embeddings shape: (3, 1000, 1280)

but when I used the code provided by BEND,

from bend.embedders import NucleotideTransformerEmbedder

# Load the embedder with a valid checkpoint name or path
embedder = NucleotideTransformerEmbedder('InstaDeepAI/nucleotide-transformer-500m-human-ref')

# Embed a list of sequences
embeddings = embedder.embed(sequences)
for e in embeddings:
    print(e.shape)
# (1, 3, 1280)
# (1, 8, 1280)
# (1, 13, 1280)

in both cases, the max_tokens = 1000 by default. I was wondering if the discrepancy (dim1) is due to the chunking or the layer used ( last hidden state vs hidden state)? would the author clarify the reason for different implement and how to achieve the mean embeddings

hongruhu commented 1 month ago

oh I see the issue, which is due to the setting padding="max_length" is not used within the embed function.