HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena
https://arxiv.org/abs/2306.15794
Apache License 2.0
602 stars 83 forks source link

tokenizer do not padding sequence to same length autoly. #16

Closed WalkerSue closed 1 year ago

exnx commented 1 year ago

You may need to describe the issue with more detail.

WalkerSue commented 1 year ago

thanks reply~ I use standalone_hyenadna.py for test. My problem is that a batch of text will not padding to the maximum length, do I need to manually fill it. code as: max_length = 32768 tokenizer = hyenadna.CharacterTokenizer( characters=['A', 'C', 'G', 'T', 'N'], # add DNA characters, N is uncertain model_max_length=max_length + 2, # to account for special tokens, like EOS add_special_tokens=False, # we handle special tokens elsewhere padding_side='left', # since HyenaDNA is causal, we pad on the left ) sequence = ['MS ACGTN', 'AAAAAAAAAAAAAAAAAAAA'] tok_seq = tokenizer(sequence) tok_seq = tok_seq["input_ids"] # grab ids tok_seq

ouput:

[[0, 6, 6, 6, 7, 8, 9, 10, 11, 1], [0, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1]]

WalkerSue commented 1 year ago

manually

it is useful for me ,add parameter : tok_seq = tokenizer(sequence, padding='max_length')

WalkerSue commented 1 year ago

add parameter : tok_seq = tokenizer(sequence, padding='max_length')