bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

Trace data in Bio::SeqIO::scf appears scrambled #56

Open cjfields opened 9 years ago

cjfields commented 9 years ago

Author Name: Charles Tilford (Charles Tilford) Original Redmine Issue: 2858, https://redmine.open-bio.org/issues/2858 Original Date: 2009-06-18 Original Assignee: Bioperl Guts


I’m using the SCF Bio::SeqIO module to parse trace data out of chromatograms. The SCF files are being produced by phred using the “-cd” parameter. The traces come out great, and the corresponding base calls from the .phd files align with the peaks wonderfully when I visualize them on a rendered trace. However, only the A bases align to the appropriate trace channel, the rest are mixed up. I find that if I do the following re-mapping, the phred base calls match the trace:

SeqIO : Remapped A : A C : G G : T T : C

The relevant part of Bio::SeqIO::scf is here:

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/scf.html\#CODE9

… which indicates that it expects the pack()ed trace data to be in order ATGC. The base call parsing code is here:

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/scf.html\#CODE8

… which is unpacking in order ACGT. As far as I can tell, the relevant official SCF documentation is here:

http://staden.sourceforge.net/manual/formats\_unix\_4.html

… which indicates that both trace and base order should be ACGT (matching the SeqIO unpack() for bases, but not traces). My empirical channel unscrambling mapping implies order ACTG, which is different from either of the two orders above. The sequence from the SCF file (should be that from original AB1 file, I think) is not perfectly identical to that called by phred, but is very similar (to be expected); that is, I don’t need to remap C, G and T to get it to align with the phred data.

So it looks like the SeqIO module is not mapping the sections of the packed trace data to the appropriate bases. The unpack order is different than the staden documentation … but so is the order I impose to correct the problem. I am still unclear as to the differences between V2 and V3 of the format. The major difference appears to be coding the trace absolutely (V2) or relatively to prior values (V3); I’d expect if I was using one format and SeqIO was trying to parse the other that I would get garbage out. Running in verbose reports “scf.pm is working with a version 2 scf.”

FWIW, the trace() method returns nothing (or “0”, I think). I am skiping the API and digging the data out from the {trace} hash in the returned SeqWithQuality object. That hash is in turn keyed to {a} {c} {t} and {g}, each pointing to an array of trace values. I suppose another fix would be to get that data formally tied into the API so the trace*() methods return the content?

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2010-03-04T13:54:40Z


I am committing a fix that corrects the order. We need to evaluate the results of the parser vs true io_lib-based output (via Bio::SCF or the BioLib bindings) to ensure we’re getting the right information. I’ll see what I can come up with.