Closed simfor closed 12 years ago
For now, the multipleAlign subroutine gives fasta output in a single string. This can be read directly into a SeqIO-object, which is done for example in the findGaps subroutine.
We can look for all the oddities mentioned above simply by using regular expressions. This means we won't need a multiple alignment parser. Hurray!
We need to parse the multiple alignment and look for oddities such as: -Sequences differing substantially in length from the rest. This might indicate a frameshift or an incorrectly annotated gene. -Poorly aligned sequences that we might have to remove altogether. -Anything more?
We have to decide which output format the aligner should give. There are existing parsers for some formats such as XML. Maybe bioPerl has a Multiple Alignment Parser? Don't reinvent the wheel...