bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

AlignIO::fasta should not flush sequences to alignment #74

Open cjfields opened 8 years ago

cjfields commented 8 years ago

Author Name: Bernd empty (Bernd empty) Original Redmine Issue: 3030, https://redmine.open-bio.org/issues/3030 Original Date: 2010-03-19 Original Assignee: Bioperl Guts


Hi,

AlignIO::fasta assumes that the fasta input is (should be) an alignment. We have mailed about this before. However, i actually find it strange that the sequences are appended automatically with ""; If that is needed it’s actually not an alignment and therefore $aln>is_flush should return false as it does with MSF, Stockholm, Clustal formats. I am not sure if changing this code will break things; possibly an ‘alignment check’ could be forced optionally.

Below the code snippet of AlignIO::fata::next_aln I mean:

my $alnlen = $aln->length; foreach my $seq ( $aln->each_seq ) { if ( $seq->length < $alnlen ) { my ($diff) = ($alnlen - $seq->length); $seq->seq( $seq->seq() . “-” x $diff); } }

The issue is that esp with user input a FASTA alignment could not be flushed and should not be changed into a corrected alignment automatically. I would be strange first having to read all sequences with SeqIO::fasta and check their length and then reading all into an Align object with SimpleAlign.

I would regard it unwanted behaviour of AlignIO::fasta to turn sequences into an alignment.

From the mailing list:

On Wed, Dec 5, 2007 at 3:56 PM, aaron.j.mackey@gsk.com wrote:

Well, if you use AlignIO::fasta to read in a multi-fasta file of unaligned sequences, AlignIO::fasta makes the assumption that all of your sequences are aligned, and pads the ends of shorter sequences with gap characters (essentially, enforcing a rather silly, yet valid alignment). The fact that is_flush() then returns 1 is secondary.

If you just want to read in an array of unaligned sequences, use SeqIO::fasta instead. It doesn’t really make much sense to use AlignIO for sequences that are not aligned … conversely, if you do have aligned sequences in a multi-fasta file, then it does make sense to use AlignIO, and it also makes sense for AlignIO::fasta to end-pad sequences with gaps as necessary to get a fully valid, flush multiple sequence alignment matrix.

-Aaron

cjfields commented 8 years ago

Original Redmine Comment Author Name: Jason Stajich Original Date: 2010-03-21T20:37:59Z


So you are disagreeing with Aaron’s response on the mailing list - I’m confused about what you want to do. If they aren’t from an alignment why are you reading them with AlignIO?

A basic assumption of the AlignIO objects is that they are parsing or writing alignment data.

If you want to read in sequences use Bio::SeqIO? What part of all of this do you find strange?

cjfields commented 8 years ago

Original Redmine Comment Author Name: Mark A. Jensen Original Date: 2010-03-21T21:33:41Z


FWIW, I occasionally like to use AlignIO for unaligned sequences in order to use its random access (by_id, by_pos) methods. MAJ (In reply to comment #1)

So you are disagreeing with Aaron’s response on the mailing list - I’m confused about what you want to do. If they aren’t from an alignment why are you reading them with AlignIO?

A basic assumption of the AlignIO objects is that they are parsing or writing alignment data.

If you want to read in sequences use Bio::SeqIO? What part of all of this do you find strange?

cjfields commented 8 years ago

Original Redmine Comment Author Name: Bernd empty Original Date: 2010-03-22T05:32:58Z


Hi Jason,

I find it strange that AlignIO::fasta (in constrast to clustal, stockholm etc) assumes input is aligned, and if it’s not making it “aligned”, though it is not. One practical problem occurs with user input (not my own;-): when a user should supply an alignment, but something is wrong with that alignment it’s not possible to chech is_flush as it’s always true. I agree with you and Aaron that if one wants to read in a set of FASTA seqs one should use SeqIO, and for alignments AlignIO. The (my) problem is that AlignIO::fasta changes unaligned FASTA input to something that looks like an alignment but is not. Thus, I disagree with Aaron and AlignIO:: fasta in this:

AlignIO::fasta makes the assumption that all of your sequences are aligned, This should not be assumed, either they are, or are not. If they are not this (in my case) is due to accidentally faulty input. and pads the ends of shorter sequences with gap characters (essentially, enforcing a rather silly, yet valid alignment). It’s a silly alignment, so why enforce such a thing?

The fact that is_flush() then returns 1 is secondary. I’d like to be able to check is_flush is OK, not that is was enforced. This is also the case with the (several) other AlignIO modules (Clustal, Stockholm, MSF) and can be used as an input sanity check.

Regards, Bernd

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2010-03-22T09:09:26Z


(In reply to comment #3)

Hi Jason,

I find it strange that AlignIO::fasta (in constrast to clustal, stockholm etc) assumes input is aligned, and if it’s not making it “aligned”, though it is not. One practical problem occurs with user input (not my own;-): when a user should supply an alignment, but something is wrong with that alignment it’s not possible to chech is_flush as it’s always true. I agree with you and Aaron that if one wants to read in a set of FASTA seqs one should use SeqIO, and for alignments AlignIO. The (my) problem is that AlignIO::fasta changes unaligned FASTA input to something that looks like an alignment but is not. Thus, I disagree with Aaron and AlignIO:: fasta in this:

AlignIO::fasta makes the assumption that all of your sequences are aligned, This should not be assumed, either they are, or are not. If they are not this (in my case) is due to accidentally faulty input. and pads the ends of shorter sequences with gap characters (essentially, enforcing a rather silly, yet valid alignment). It’s a silly alignment, so why enforce such a thing?

The fact that is_flush() then returns 1 is secondary. I’d like to be able to check is_flush is OK, not that is was enforced. This is also the case with the (several) other AlignIO modules (Clustal, Stockholm, MSF) and can be used as an input sanity check.

Regards, Bernd

Okay, I see where you’re going (exception on user error). I tend to agree with both sides; the user should know something’s wrong, but it should still work when needed. Maybe automatically making the sequences flush should be an option for this format? The parser could throw/warn otherwise.