GMOD / mimosa

Miniature Model Organism Sequence Aligner
http://gmod.github.com/mimosa/
Artistic License 2.0
4 stars 2 forks source link

Alphabet detection #123

Open leto opened 13 years ago

leto commented 13 years ago

Currently, when we autodetect a new sequence file, we assume it is in the nucleotide alphabet, which is wrong. We need a function which, given a FASTA sequence file, does Deep Content Inspection to detect the alphabet. This can be done by ignoring all lines that begin with >, then the alphabet is nucleotide if characters other than ATCG exist.

Wrinkle: What about the N,X, * characters?

cjfields commented 13 years ago

Raw DNA seq files could have ATGCNRYSWKMBDHV. See: http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml

leto commented 13 years ago

Thanks @cjfields! Very helpful. Do you think alphabet detection should be in Bioperl itself ? Seems like a basic thing that many people could benefit from.

cjfields commented 13 years ago

@leto, it is present in Bio::PrimarySeq::validate_seq(). Bio::Seq delegates to Bio::PrimarySeq and is checked whenever seq() is set. We should probably yank that code out into something exportable.