Closed jfmao closed 10 months ago
Hi Jian-Feng,
Thanks for the great question. For now, you can convert any NumPy array of nucleotides (having dtype 'S1' in SeqPro) to a list of Python strings that can be used with HuggingFace tokenizers. Here's a gist to see how that works. To your point, I may add something along these lines as a convenience function in the future.
Cheers, David
Hi David,
Tons of thanks for your reply and guidance.
Best
Jian-Feng
Hi SeqPro team,
Very great platform for DNA sequence modeling. I would like to know if you have plan on Byte Pair Encoding (BPE) tokenization. It is getting attractive, as DNAbert-2 and some other projects are using it. Or any of your idea on integrating this (https://github.com/aglabx/dnaBPE) to SeqPro? or integrating this (https://huggingface.co/dnagpt/human_gpt2-v1)?
Tons of thanks.
All the best