Open cjfields opened 8 years ago
Original Redmine Comment Author Name: Chris Fields Original Date: 2010-04-26T12:37:41Z
Bernd, this will likely be handled with the scheduled Align refactor, but it may break API so I’m pushing it to 1.7 and listing it as an enhancement.
Original Redmine Comment Author Name: Bernd empty Original Date: 2010-04-28T08:04:55Z
Sure. Just one thing to think about: the interleaved formats (e.g. clustalw, msf, stockholm,selex) also use hashed to concatenate sequences. It would be great if the readers could handle duplicate IDs too. E.g phylip.pm uses $hash{$count}
Original Redmine Comment Author Name: Chris Fields Original Date: 2010-04-28T08:36:45Z
Actually, I think the current the sequence storage is indexed by NSE instead of simple seq_id (NSE takes into account seq_id, version, start, end, strand). For example. one can parse Rfam output via Bio::AlignIO::stockholm; Rfam contains multiple sequences with the same ID but different locations, therefore different NSE. From SimpleAlign::add_seq:
$name = $seq->get_nse;
if( $self->{’_seq’}->{$name} ) { $self->warn(“Replacing one sequence [$name]\n”) unless $self->verbose < 0; }
I would consider the ability to catch possibly redundant seqs (e.g. same NSE) to be a feature, not a bug, so we would need some reasonable explanation as to why this is necessary, and why the solution you suggest (i.e. modifying the seq_id, version, etc) wouldn’t be a more appropriate solution.
(In reply to comment #2)
Sure. Just one thing to think about: the interleaved formats (e.g. clustalw, msf, stockholm,selex) also use hashed to concatenate sequences. It would be great if the readers could handle duplicate IDs too. E.g phylip.pm uses $hash{$count}
Author Name: Bernd empty (Bernd empty) Original Redmine Issue: 3061, https://redmine.open-bio.org/issues/3061 Original Date: 2010-04-22 Original Assignee: Bioperl Guts
Hi
Something I stumble one from time to time: the storage of sequence in AlignIO is based in SeqIDs. This complicated reading alignments with duplicate IDs, which actually do occur quite a lot (e.g. CDD of NCBI). Usually I try to “uniqfy” IDs but this is not straightforward for all alignments formats. Actually this is were BioPerl is really useful ;-) I’d propose to store the Sequences in a hash in AlignIO using unique keys, possibly optionally, to be able to read all sequences in the alignment, even when they all have the same ID.
This would solve the replacing warnings too. —————————— WARNING ——————————- MSG: Replacing one sequence [10/1-214]
Possibly this can be taken in with the AlignIO refactoring
Regards, Bernd