Open leto opened 13 years ago
Raw DNA seq files could have ATGCNRYSWKMBDHV. See: http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml
Thanks @cjfields! Very helpful. Do you think alphabet detection should be in Bioperl itself ? Seems like a basic thing that many people could benefit from.
@leto, it is present in Bio::PrimarySeq::validate_seq(). Bio::Seq delegates to Bio::PrimarySeq and is checked whenever seq() is set. We should probably yank that code out into something exportable.
Currently, when we autodetect a new sequence file, we assume it is in the nucleotide alphabet, which is wrong. We need a function which, given a FASTA sequence file, does Deep Content Inspection to detect the alphabet. This can be done by ignoring all lines that begin with >, then the alphabet is nucleotide if characters other than ATCG exist.
Wrinkle: What about the N,X, * characters?