Open basehc opened 1 year ago
Hi @basehc , I pointed this out in a closed issue as well. Let me know if this works for you - https://github.com/Zhihan1996/DNABERT_2/issues/26
Dear @akd13 , Thank your mentions. Actually, I have checked your issue long time ago before I opened this issue, I have no idea whether
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
tokens = tokenizer(sequence*10, return_tensors = 'pt', padding='max_length', truncation=True, max_length = 2000)
config = dnabert.config
config.max_position_embeddings = 2000
dnabert = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M",config=config) #Pretrained model
input_ids = tokens.input_ids
attention_mask = tokens.attention_mask
token_type_ids = tokens.token_type_ids
hidden_states = dnabert(input_ids, attention_mask)
above code can work. I may try it If I have time. However, according to the original paper and words from author, it seems like the 'tokenizer' function from DNABERT2 should deal with sequence (>512) automaticly. Based the experiments from DNABERT2 paper, Author show some results on Virus classifaction which length is about 1000 bp. So I am confused. My mind is that, The DNABERT2 model loaded from huggingface may be based on original BERT model, But I will try to check.
You're right. If I add the trust_remote_code=True
flag, and change the config, highly likely it defaults to the original BERT model.
Hi, I wonder if this problem was resolved or not. It seems that if I have sequence > 512 as inputs, the model will meet CUDA OOM error.
Part 1: Tokenization and Dataset Preparation
Part 2: Retrieving Model Configuration