MAGICS-LAB / DNABERT_2

[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome
Apache License 2.0
212 stars 49 forks source link

Quickstart Does not work and Embedding Dim is not 768 #75

Closed Leo-T-Zang closed 1 month ago

Leo-T-Zang commented 3 months ago

Hi DNABert Team,

Your provided Quick Start Code is not working with the following errors.

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"].to(device)
hidden_states = model(inputs)[0] # [1, sequence_length, 768]

# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768

Erros Logs:

Traceback (most recent call last):
  File "/workspace/work/CLIP/DNA/DNA_emb.py", line 22, in <module>
    model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    cls.register(config.__class__, model_class, exist_ok=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 587, in register
    raise ValueError(
ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'transformers.models.bert.configuration_bert.BertConfig'> and you passed <class 'transformers_modules.zhihan1996.DNABERT-2-117M.dd10f74f0e90735d02a27603e56467761893e8f9.configuration_bert.BertConfig'>. Fix one of those so they match!

I managed to make it to run by using BertConfig as below:

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M")
model = AutoModelForMaskedLM.from_config(config).to(device)

Yet, the output embedding dimension is 4096 instead of 768.

Could you help me out? Thanks a lot.

Zhihan1996 commented 1 month ago

Sorry for this super late reply. It looks like an issue of transformers version. Please try pip install transformers=4.28.0