Despite multiple trials and examining the model configuration, it seems that the model hosted on Hugging Face (`huggingface.co`) cannot handle sequences that exceed a length of 512 tokens. I've provided the relevant code below for clarity

MAGICS-LAB / DNABERT_2

[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome

Apache License 2.0

268 stars 60 forks source link

Despite multiple trials and examining the model configuration, it seems that the model hosted on Hugging Face (`huggingface.co`) cannot handle sequences that exceed a length of 512 tokens. I've provided the relevant code below for clarity #30

Open basehc opened 1 year ago

basehc commented 1 year ago

Part 1: Tokenization and Dataset Preparation

from transformers import AutoTokenizer, BertForSequenceClassification
from torch.utils.data import Dataset

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = BertForSequenceClassification.from_pretrained("zhihan1996/DNABERT-2-117M", num_labels=8)

class DNADataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        seq, label = self.data[idx]
        inputs = self.tokenizer(seq, return_tensors='pt', padding='max_length', max_length=600, truncation=True)
        return {
            'input_ids': inputs["input_ids"].squeeze(),
            'label': label
        }

Part 2: Retrieving Model Configuration

from transformers import AutoConfig

config = AutoConfig.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
print(config.max_position_embeddings)

akd13 commented 1 year ago

Hi @basehc , I pointed this out in a closed issue as well. Let me know if this works for you - https://github.com/Zhihan1996/DNABERT_2/issues/26

basehc commented 1 year ago

Dear @akd13 , Thank your mentions. Actually, I have checked your issue long time ago before I opened this issue, I have no idea whether

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
tokens = tokenizer(sequence*10, return_tensors = 'pt', padding='max_length', truncation=True, max_length = 2000)
config = dnabert.config
config.max_position_embeddings = 2000
dnabert = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M",config=config) #Pretrained model
input_ids = tokens.input_ids
attention_mask = tokens.attention_mask
token_type_ids = tokens.token_type_ids
hidden_states = dnabert(input_ids, attention_mask)

above code can work. I may try it If I have time. However, according to the original paper and words from author, it seems like the 'tokenizer' function from DNABERT2 should deal with sequence (>512) automaticly. Based the experiments from DNABERT2 paper, Author show some results on Virus classifaction which length is about 1000 bp. So I am confused. My mind is that, The DNABERT2 model loaded from huggingface may be based on original BERT model, But I will try to check.

akd13 commented 1 year ago

You're right. If I add the trust_remote_code=True flag, and change the config, highly likely it defaults to the original BERT model.

HelloWorldLTY commented 1 month ago

Hi, I wonder if this problem was resolved or not. It seems that if I have sequence > 512 as inputs, the model will meet CUDA OOM error.