bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

AlignIO hash sequence storage #78

Open cjfields opened 8 years ago

cjfields commented 8 years ago

Author Name: Bernd empty (Bernd empty) Original Redmine Issue: 3061, https://redmine.open-bio.org/issues/3061 Original Date: 2010-04-22 Original Assignee: Bioperl Guts


Hi

Something I stumble one from time to time: the storage of sequence in AlignIO is based in SeqIDs. This complicated reading alignments with duplicate IDs, which actually do occur quite a lot (e.g. CDD of NCBI). Usually I try to “uniqfy” IDs but this is not straightforward for all alignments formats. Actually this is were BioPerl is really useful ;-) I’d propose to store the Sequences in a hash in AlignIO using unique keys, possibly optionally, to be able to read all sequences in the alignment, even when they all have the same ID.

This would solve the replacing warnings too. —————————— WARNING ——————————- MSG: Replacing one sequence [10/1-214]

Possibly this can be taken in with the AlignIO refactoring

Regards, Bernd

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2010-04-26T12:37:41Z


Bernd, this will likely be handled with the scheduled Align refactor, but it may break API so I’m pushing it to 1.7 and listing it as an enhancement.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Bernd empty Original Date: 2010-04-28T08:04:55Z


Sure. Just one thing to think about: the interleaved formats (e.g. clustalw, msf, stockholm,selex) also use hashed to concatenate sequences. It would be great if the readers could handle duplicate IDs too. E.g phylip.pm uses $hash{$count}

cjfields commented 8 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2010-04-28T08:36:45Z


Actually, I think the current the sequence storage is indexed by NSE instead of simple seq_id (NSE takes into account seq_id, version, start, end, strand). For example. one can parse Rfam output via Bio::AlignIO::stockholm; Rfam contains multiple sequences with the same ID but different locations, therefore different NSE. From SimpleAlign::add_seq:

$name = $seq->get_nse;

if( $self->{’_seq’}->{$name} ) { $self->warn(“Replacing one sequence [$name]\n”) unless $self->verbose < 0; }

I would consider the ability to catch possibly redundant seqs (e.g. same NSE) to be a feature, not a bug, so we would need some reasonable explanation as to why this is necessary, and why the solution you suggest (i.e. modifying the seq_id, version, etc) wouldn’t be a more appropriate solution.

(In reply to comment #2)

Sure. Just one thing to think about: the interleaved formats (e.g. clustalw, msf, stockholm,selex) also use hashed to concatenate sequences. It would be great if the readers could handle duplicate IDs too. E.g phylip.pm uses $hash{$count}