ML4GLand / SeqPro

Genomic sequence preprocessing toolkit
MIT License
10 stars 1 forks source link

Do you have plan to incorporate Byte Pair Encoding (BPE) tokenizer? #9

Closed jfmao closed 10 months ago

jfmao commented 10 months ago

Hi SeqPro team,

Very great platform for DNA sequence modeling. I would like to know if you have plan on Byte Pair Encoding (BPE) tokenization. It is getting attractive, as DNAbert-2 and some other projects are using it. Or any of your idea on integrating this (https://github.com/aglabx/dnaBPE) to SeqPro? or integrating this (https://huggingface.co/dnagpt/human_gpt2-v1)?

Tons of thanks.

All the best

d-laub commented 10 months ago

Hi Jian-Feng,

Thanks for the great question. For now, you can convert any NumPy array of nucleotides (having dtype 'S1' in SeqPro) to a list of Python strings that can be used with HuggingFace tokenizers. Here's a gist to see how that works. To your point, I may add something along these lines as a convenience function in the future.

Cheers, David

jfmao commented 10 months ago

Hi David,

Tons of thanks for your reply and guidance.

Best

Jian-Feng