GMOD / GBrowse-Adaptors

8 stars 6 forks source link

Bio::DB::Bam::Alignment : $align->query->end < 0 !!! #4

Closed jh13-wtsi closed 9 years ago

jh13-wtsi commented 10 years ago

It is possible to create an alignment from valid BAM data such that $align->query->end < 0 . The problem is that $query->end calls the samtools function bam_cigar2qlen() to get the total length and subtracts all clipping (hard and soft) at the end of the CIGAR. This is incorrect because bam_cigar2qlen() only accounts for soft clipping, so if there is hard clipping at the end as well then the return value is too small. If there is a lot of hard clipping then the result can be negative.

jh13-wtsi commented 10 years ago

Here is an explicit example using a BAM file of PacBio reads aligned with GMAP.

$ samtools view 'ftp://ngs.sanger.ac.uk/scratch/project/jmg/MCF7_Boley/IsoSeq_MCF7.gmapdenovo.sorted.bam' chr4:0-100000 | grep 'm130604_203349_42161_c100519042550000001823081209281390_s1_p0/159470/ccs'

m130604_203349_42161_c100519042550000001823081209281390_s1_p0/159470/ccs        16      chr4    86163   0       9M3I4M2I2M2I12M1I15M1I2M1I9M3I15M4I5M3I14M10I7M1I20M2I10M2I4M3I3M1I17M3I4M1I4M1I4M1I14M3I15M1I18M2I4M2I22M1I5M3I11M1I8M3I8M1I10M1D10M1I15M1I16M2I14M1I18M1I40M1I19M2I17M1I2M1I6M2I8M1I8M1I7M1I6M3I5M1I7M1I5M1I7M1I2M2I18M1I16M1I9M1I23M1D14M1I5M2I19M1I26M1I13M4I3M348H *       0       0       AAATGTGAAAGGGAATTGGTTGGGCAAAGCCTTTAACAAGGTCCACAACAGCTGGAATGAACACTGCCAAAAAATTCATACTACTTGGAGGTCAGAAACCTACCAGACATTGCTGGTTATAAGGCAATGTGGCAAAGCTTTAGGAATGGGTCCACAAGGGCCTTGGGAATTTAACATAAGAATATTCGCAGTACTTGGAGGAGACAACCCTACAAATATGTAAAAGAATGGCGAAAAGGCCTTTAGACAGTCCAGGGCAGCCCTGGAATGAACATAAAAATATTCATTACTGGCTGCGAAAAACCCATACACATGGGTGGAAAAATGGTGGCAAAGCTTTAACCAATGCCTCAAGTCTTATTATTACACAGGAGCATTCATTTTCTGAACAAAAACTTTTACAAATGTGAAGAATGGTGGCAAAAGCTTTACTTTGTCCTCATCCCTTAATAAACATTAAGAGAATTCATACTGGAGGGAGAAACCCTACACATGCTGAAAGAATCGTTGGCAAAGGCTTTTTATTAGGTCCTTCACACCTGCTTGCTTAAGCATAAAGAGAAATTCATGACTTTGGAGGAGAACCGTACACGGTGCGAAGAATGTGGCAAAAGCTTTTAAACCAATCGTCAACTCTTATATTCACAAGAGAATCCATTTCTGGCGGCAAAAACCTTACAAATGTTGAAGAATGGGGCAAAGCCTTTACACGGGTCCACAACACTTGGAGCA    *       MD:Z:54A1G23CTA2A2G3A14C2TA2T17G15A23G9TG1C25T45T17^T81GC8G65G36A22AGA4C38C14^A20A26T30A1       NH:i:9  HI:i:1  NM:i:134        SM:i:40 XQ:i:0  X2:i:0  XO:Z:UT XT:Z:NN-NN,0.00,0.00,347..349   PG:Z:M
m130604_203349_42161_c100519042550000001823081209281390_s1_p0/159470/ccs        0       chr4    86644   0       73S31M2D13M2I20M1I11M1D16M1D14M161636N17M1I6M5I17060N61M3I13M2I4M2I9M2I13M2I8M2I17M732H *       0       0       AATCAGATAGACTTTTATTTATAAGGCTCTTTGCCAACACACACCATTGCTTAACTACAATATAATAACAAATTTCATACTGGAGAGAAACCCTACACGTGCGAGATGTGGCAAAGCCTTTTAAAACAATCCTCAACTCTTTATATTACACAGAGAATCCATTCTGGCAAAAACCTTACAAAATTGTGAAAGAAGTGGCCAAAGCCCTGATGCTGTAGGGTTTCTCCTCAGTATGAATATTCTTATGTTCATCAGGCTTGGCGGACCCCATATGAAAGCTTTTGCCACCCATTTCTTCCACATTTATGTAGGGTTTCTCCGCCCAGTATCGGAATTTTCTTATGTTCG    *       MD:Z:31^AG0A15T2C24^A16^A16A6GA1T10ATG13TC11C9T2T1A1GG1T5TTGTC5GC11G1T13C5T12C10A       NH:i:9  HI:i:1  NM:i:58 SM:i:40 XQ:i:0  X2:i:0 XO:Z:UT  XS:A:-  XT:Z:NN-NN,0.00,0.00,347..349   PG:Z:M

Note the trailing "732H" in the CIGAR of the second read.

Now let's use perl to examine the same reads.

#!/usr/bin/env perl

use strict;
use warnings;

my $bam = 'ftp://ngs.sanger.ac.uk/scratch/project/jmg/MCF7_Boley/IsoSeq_MCF7.gmapdenovo.sorted.bam';
my $qname = 'm130604_203349_42161_c100519042550000001823081209281390_s1_p0/159470/ccs';

use Bio::DB::Sam;

printf "%4d -> %4d\n", $_->query->start, $_->query->end
    for ( grep { $_->qname eq $qname; }
          Bio::DB::Sam
          ->new( '-bam' => $bam )
          ->get_features_by_location(
              '-seq_id' => 'chr4',
              '-start'  => 0,
              '-end'    => 100000,
          ));

==>

   1 ->  384
  74 -> -384

The second read has a negative query end due to the large hard clip at the end of the CIGAR!

jh13-wtsi commented 10 years ago

Closed in error! (Still getting used to this interface.)

jh13-wtsi commented 10 years ago

Also, if we look at the output of $align->padded_alignment then the first alignment looks sensible but the second does not. This may be due to the large skips in the reference in the second CIGAR string.

   1 ->  384

> AAATGTGAA---GAAT--GT--GGCAAAGCCTTT-ACAAGGTCCACAACA-CT-GAATGAACA---CAAG
> |||||||||   ||||  ||  |||||||||||| ||||||||||||||| || |||||||||   ||||
> AAATGTGAAAGGGAATTGGTTGGGCAAAGCCTTTAACAAGGTCCACAACAGCTGGAATGAACACTGCCAA

> AAAATTCATAC----TGGAG---AGAAACCCTACAAA----------TGTAAAG-AATGTGGCAAAGCCT
> |||||||||||    |||||   ||||||||||||||          ||||||| |||||||||||||||
> AAAATTCATACTACTTGGAGGTCAGAAACCTACCAGACATTGCTGGTTATAAGGCAATGTGGCAAAGCTT

> TTAGA--TGGTCCACAA--GCCT---GAA-TGAACATAAGAATATTC---ATAC-TGGA-GAGA-AACCC
> |||||  ||||||||||  ||||   ||| |||||||||||||||||   |||| |||| |||| |||||
> TAGGAATGGGTCCACAAGGGCCTTGGGAATTTAACATAAGAATATTCGCAGTACTTGGAGGAGACAACCC

> TACAAATGT---AAAGAATGTGGCAAA-GCCTTTAGACAGTCCAGG--AGCC--TGAATGAACATAAAAA
> |||||||||   ||||||||||||||| ||||||||||||||||||  ||||  ||||||||||||||||
> TACAAATATGTAAAAGAATGGCGAAAAGGCCTTTAGACAGTCCAGGGCAGCCCTGGAATGAACATAAAAA

> TATTCA-TACTG---GCGAAAAACCC-TACACATG---TGAAAAAT-GTGGCAAAGCTTTTAACCAAT-C
> |||||| |||||   ||||||||||| ||||||||   |||||||| |||||||||| |||||||||| |
> TATTCATTACTGGCTGCGAAAAACCCATACACATGGGTGGAAAAATGGTGGCAAAGC-TTTAACCAATGC

> CTCAAGTCTTATTA-TACACAGGAGCATTCA--TTCTGAACAAAAAC-TTTACAAATGTGAAGAAT-GTG
> |||||||||||||| ||||||||||||||||  |||||||||||||| |||||||||||||||||| |||
> CTCAAGTCTTATTATTACACAGGAGCATTCATTTTCTGAACAAAAACTTTTACAAATGTGAAGAATGGTG

> GCAAAGCCTTTACTTGGTCCTCATCCCTTAATAAACA-TAAGAGAATTCATACTGGA--GAGAAACCCTA
> ||||||||||||||||||||||||||||||||||||| |||||||||||||||||||  |||||||||||
> GCAAAAGCTTTACTTTGTCCTCATCCCTTAATAAACATTAAGAGAATTCATACTGGAGGGAGAAACCCTA

> CACATG-TG-AAGAAT--GTGGCAAA-GCTTTTTA-TAGGTCC-TCACAC---CTTGC-TAAACAT-AAG
> |||||| || ||||||  |||||||| |||||||| ||||||| ||||||   ||||| ||||||| |||
> CACATGCTGAAAGAATCGTTGGCAAAGGCTTTTTATTAGGTCCTTCACACCTGCTTGCTTAAGCATAAAG

> AG-AATTCAT-AC--TGGAGAGAAACCCTACAC-GTGCGAAGAATGTGGC-AAAGCTTTT-AACCAATCC
> || ||||||| ||  |||||||||||||||||| |||||||||||||||| ||||||||| |||||||||
> AGAAATTCATGACTTTGGAGGAGAACCGTACACGGTGCGAAGAATGTGGCAAAAGCTTTTAAACCAATCG

> TCAACTCTTATATTACACAAGAGAATCCA-TTCTG--GACAAAAACCTTACAAATG-TGAAGAATGTGGC
> |||||||||||||| |||||||||||||| |||||  ||||||||||||||||||| |||||||||||||
> TCAACTCTTATATT-CACAAGAGAATCCATTTCTGGCGGCAAAAACCTTACAAATGTTGAAGAATGGGGC

> AAAGCCTTTACAC-GGTCCACAACACT----GAA
> ||||||||||||| |||||||||||||    |||
> AAAGCCTTTACACGGGTCCACAACACTTGGAGCA

  74 -> -384

> ----------------------------------------------------------------------
>                                                                       
> AATCAGATAGACTTTTATTTATAAGGCTCTTTGCCAACACACACCATTGCTTAACTACAATATAATAACA

> ---TTCATACTGGAGAGAAACCCTACACGTGCGAAGAATGTGGCAAAGC--TTTTAACCAATCCTCAACT
>    |||||||||||||||||||||||||||||||  |||||||||||||  |||||||||||||||||||
> AATTTCATACTGGAGAGAAACCCTACACGTGCGA--GATGTGGCAAAGCCTTTTAAAACAATCCTCAACT

> C-TTATATTACACAAGAGAATCCATTCTGGACAAAAACCTTACAAAAATGTGAAGAATGTGGCAAAGCAT
> | ||||||||||| |||||||||||||||| ||||||||||||||                         
> CTTTATATTACAC-AGAGAATCCATTCTGG-CAAAAACCTTACAA-------------------------

> GTGTAGGGTTTCTCTCCAGTATGAATACTCTTATGTTTATTAAGGGTTGCGGATTGTCTAAAGGCTTTGC
>                                                                       
> ----------------------------------------------------------------------

> CACATTGTTCACATTTGTAGGGCTTCTCTCCAGTATGAATTCTCTTATGTTCA-----------------
>                                                                       
> ----------------------------------------------------------------------

> --
>                                                                       
> ----------------------------------------------------------------------
lstein commented 10 years ago

This ought to fix the problem you describe. Could you give it a whirl?

Lincoln

diff --git a/Bio-SamTools/lib/Bio/DB/Bam/Query.pm b/Bio-SamTools/lib/Bio/DB/Bam/Query.pm
index 235b0b2..2aa9f76 100644
--- a/Bio-SamTools/lib/Bio/DB/Bam/Query.pm
+++ b/Bio-SamTools/lib/Bio/DB/Bam/Query.pm
@@ -36,7 +36,8 @@ use Bio::DB::Sam::Constants qw(CIGAR_SYMBOLS BAM_CREF_SKIP BAM_CSOFT_CLIP BAM_CH

 use constant CIGAR_SKIP      => {CIGAR_SYMBOLS->[BAM_CREF_SKIP]  => 1,
                 CIGAR_SYMBOLS->[BAM_CSOFT_CLIP] => 1,
-                CIGAR_SYMBOLS->[BAM_CHARD_CLIP] => 1};
+#               CIGAR_SYMBOLS->[BAM_CHARD_CLIP] => 1
+};

 sub new {
jh13-wtsi commented 10 years ago

Lincoln Stein wrote:

This ought to fix the problem you describe.

diff --git a/Bio-SamTools/lib/Bio/DB/Bam/Query.pm b/Bio-SamTools/lib/Bio/DB/Bam/Query.pm index 235b0b2..2aa9f76 100644 --- a/Bio-SamTools/lib/Bio/DB/Bam/Query.pm +++ b/Bio-SamTools/lib/Bio/DB/Bam/Query.pm @@ -36,7 +36,8 @@ use Bio::DB::Sam::Constants qw(CIGAR_SYMBOLS BAM_CREF_SKIP BAM_CSOFT_CLIP BAM_CH

use constant CIGAR_SKIP => {CIGAR_SYMBOLS->[BAM_CREF_SKIP] => 1, CIGAR_SYMBOLS->[BAM_CSOFT_CLIP] => 1,

  •            CIGAR_SYMBOLS->[BAM_CHARD_CLIP] => 1};

    +# CIGAR_SYMBOLS->[BAM_CHARD_CLIP] => 1 +};

    sub new {

This doesn't work if either end of the CIGAR has both soft and hard clipping. The SAM 1.4 specification says "S may only have H operations between them and the ends of the CIGAR string.". That wording implies it is valid to have both S and H at the end.

Jeremy Henty

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

keiranmraine commented 10 years ago

I don't know if this is helpful but...

5H2S50M2S5H is a valid cigar. H will never contribute to the position in the current record as the sequence doesn't exist.

Under BWA-mem mappings it is common to get

I had originally (on an internal Sanger email) suggested the following:

https://github.com/GMOD/GBrowse-Adaptors/blob/master/Bio-SamTools/lib/Bio/DB/Bam/Query.pm#L117-L140

I believe that the solution is to add:

next if CIGAR_SYMBOLS->[BAM_CHARD_CLIP] eq $c->[0];

between 121-122 and 135,136.

(and remove BAM_CHARD_CLIP from the CIGAR_SKIP constant, provided it's not referenced outside of the module)

This will also have a knock on effect on the length and subset functions within this object (impact on others unknown).

However this only corrects the position values not the alignments.

lstein commented 10 years ago

Ok, I will need to set aside a chunk of time to work on this. It requests some thought.

Lincoln

On Thu, Sep 25, 2014 at 9:43 AM, Keiran Raine notifications@github.com wrote:

I don't know if this is helpful but...

5H2S50M2S5H is a valid cigar. H will never contribute to the position in the current record as the sequence doesn't exist.

Under BWA-mem mappings it is common to get

  • read 1
    • 100M
    • read 2
    • 50M50S
    • read 2 (supplimentary)
    • 50H45M5S

I had originally (on an internal Sanger email) suggested the following:

https://github.com/GMOD/GBrowse-Adaptors/blob/master/Bio-SamTools/lib/Bio/DB/Bam/Query.pm#L117-L140

I believe that the solution is to add:

next if CIGAR_SYMBOLS->[BAM_CHARD_CLIP] eq $c->[0];

between 121-122 and 135,136.

(and remove BAM_CHARD_CLIP from the CIGAR_SKIP constant, provided it's not referenced outside of the module)

This will also have a knock on effect on the length and subset functions within this object (impact on others unknown).

However this only corrects the position values not the alignments.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/GBrowse-Adaptors/issues/4#issuecomment-56819939.

Lincoln Stein Director, Informatics and Bio-computing Program and Senior Principal Investigator Professor, Department of Molecular Genetics, University of Toronto

Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3

Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein@gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news

Executive Assistant Renata Musa Tel: 647-260-7956 Email: renata.musa@oicr.on.ca www.oicr.on.ca

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

keiranmraine commented 9 years ago

@lstein I've built a test file with a read that should aid testing of this. I'm happy to spend some time fixing it unless you've already done most of it. Let me know.

Keiran

lstein commented 9 years ago

Hi Keiran, I did do some work on this. Have you checked to see if it fixes the problem? Otherwise would appreciate getting the test file.