UNF-PIPE / Tha-pipe

A fylogenetics pipeline
9 stars 1 forks source link

Multiple Alignment Parser #4

Closed simfor closed 12 years ago

simfor commented 12 years ago

We need to parse the multiple alignment and look for oddities such as: -Sequences differing substantially in length from the rest. This might indicate a frameshift or an incorrectly annotated gene. -Poorly aligned sequences that we might have to remove altogether. -Anything more?

We have to decide which output format the aligner should give. There are existing parsers for some formats such as XML. Maybe bioPerl has a Multiple Alignment Parser? Don't reinvent the wheel...

simfor commented 12 years ago

For now, the multipleAlign subroutine gives fasta output in a single string. This can be read directly into a SeqIO-object, which is done for example in the findGaps subroutine.

We can look for all the oddities mentioned above simply by using regular expressions. This means we won't need a multiple alignment parser. Hurray!