Closed WalkerSue closed 1 year ago
thanks reply~ I use standalone_hyenadna.py for test. My problem is that a batch of text will not padding to the maximum length, do I need to manually fill it. code as: max_length = 32768 tokenizer = hyenadna.CharacterTokenizer( characters=['A', 'C', 'G', 'T', 'N'], # add DNA characters, N is uncertain model_max_length=max_length + 2, # to account for special tokens, like EOS add_special_tokens=False, # we handle special tokens elsewhere padding_side='left', # since HyenaDNA is causal, we pad on the left ) sequence = ['MS ACGTN', 'AAAAAAAAAAAAAAAAAAAA'] tok_seq = tokenizer(sequence) tok_seq = tok_seq["input_ids"] # grab ids tok_seq
[[0, 6, 6, 6, 7, 8, 9, 10, 11, 1], [0, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1]]
manually
it is useful for me ,add parameter : tok_seq = tokenizer(sequence, padding='max_length')
add parameter : tok_seq = tokenizer(sequence, padding='max_length')
You may need to describe the issue with more detail.