BioJulia / BioSequences.jl

Biological sequences for the julia language
http://biojulia.dev/BioSequences.jl
MIT License
150 stars 47 forks source link

One-hot encoding of sequences #130

Closed cossio closed 3 years ago

cossio commented 3 years ago

This is a feature request.

Machine learning models of biological sequences often work by representing sequences in one-hot encoding. I think it would be nice to add support for onehot encoding/decoding of biological sequences in this package.

jakobnissen commented 3 years ago

Hello @cossio

I think that's outside the scope of BioSequences. One-hot encoding pertains to the specific machine learning model that uses DNA sequences, not to the sequences themselves. There is no way we can predict what kind of embedding or transformation machine learning people will want to do with biological sequences, so that is better left up to the people who want to do that. For example - should it include Ns, or other ambiguous nucleotides, should it be a bitmatrix or an integer matrix, and should amino acids use a reduced alphabet?

jakobnissen commented 3 years ago

I'm not aware of any Julia package using machine learning on biological sequences. But it's fairly easy to add a function that one-hot encodes sequences. Here's one: onehot(s::BioSequence{A}) where A = reduce(vcat, [reshape(s .== i, 1, :) for i in A()]) Not the most efficient, but probably good enough for most use cases.