HongzheGuo / deBGA

de Bruijn Graph-based read aligner
33 stars 8 forks source link

Malformed header, CIGAR and strange XA field #2

Open ulah opened 7 years ago

ulah commented 7 years ago

Hi,

I have some problems in understanding the output sam and found some bugs in the CIGAR generation, too.

My command line for the alignment is: deBGA aln -k 22 -f 30 -u 2000 -x 50 -l 512 -p 16 $idxPath $fRead $rRead $samOut

1) The header only contains SQ fields. There is no HD field and no possibility to add a RG field which is a prerequisite for most variant callers. Could you provide an update to provide write these directly to the header before mapping. This would make it possible to pipe the output directly into samtools to create sorted bam files. Otherwise I would need to create huge sam files (450Gb for my whole genome data), modify the header accordingly and then convert it to a bam which takes quite some CPU time...

2) sam to bam conversion fails due to malformed CIGAR strings: $ samtools view -b ./testMapping.sam > ./testMapping.bam [E::sam_parse1] CIGAR and query sequence are of different length [W::sam_read1] parse error at line 562 [main_samview] truncated file.

Looking at line 562 you can see that the CIGAR contains at the end 2 M-blocks (101M99M). The correct one (at least concerning the sequence length) would be 99M.

$ sed -n '562,562'p testMapping.sam ST-E00192:25:H25VHALXX:3:1101:21959:1801 161 chr1 121484047 0 13S1M2I8M2I1M1I101M99M = 121484046 126 TTGGTTTGAAACGGGATTTGTTCATATTATGCTAGACAGAATAATTCTCAGTAACTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAAGGATCCTTTACAGAGAGCAGGCTTGAAACACTCTT A,,A,,,,AFKKFFK<,AA,7A7AKKKKKAF,FK7FFKK,FKAFFK,<,,AFAKKFA,<FAAKKKKAFAFFFFKKA,7FKKK7FKKKFFAA7,F7<77FFKFAKFK,,<FK<,A7<,,AFKKA77<F XA:Z:chr19,+27732030,127M,7;chr19,+27736445,7S120M,6;chr19,+27737460,3S99M2I5M1I3M14S,3;chr10,-42529964,98M1D2M1I6M20S,7;chr10,-42529291,7S120M,7;chr7,+61969354,28S9M2D90M,7;chr7,+61971051,28S15M1D1M1D83M,7;chr9,+66800813,27S5M2I1M2I82M8S,7;chr9,+69990560,27S6M2I1M2I82M7S,7;chr10,+42404635,12S115M,7;chr4,-68266614,14S2I83M3S,1;chr4,-68266605,7S91M1D2M1I26M,7;chr3,-90464110,14S3M1I6M2I39M1I2M2I27M1I1M28S,7;chr5,-49517155,22S3I1M1I5M4I30S,7;chr6,-61914242,14S2I105M6S,1;chr10,-42538804,14S2I105M6S,1;chr10,-42541862,14S2I105M6S,1;chr5,-49517142,19S5M1D2M1D71M30S,7;chr10,-42538794,6S118M3S,7;chr11,-48943958,14S82M31S,0;chr11,-50718571,21S3M2I40M1D3M1I22M35S,2;chr11,-50755957,23S2I1M1I5M4I35S,6;chr11,-55019377,17S79M31S,0;chr15,-20001962,14S69M2I1M1I3M37S,3;chr10,-42541852,6S118M3S,7;chr19_gl000208_random,-11322,22S3I1M1I5M4I27S,9;chr19_gl000208_random,-48103,14S2I81M28S,2;chr10,-42544910,6S121M,7;chr11,-48935955,96M31S,7;chr11,-48943953,8S5M1I82M31S,7;chr11,-50718565,14S3M1I6M2I40M1D5M1I24M31S,7;chr11,-50755947,18S6M2I3M1I4M1D62M31S,7;chr11,-55019377,14S3M1I7M1I3M1I66M24S,7;chr15,-20001948,83M2I1M1I3M37S,7;chr19,-27888758,14S3M1I6M2I41M29S,7;chr19_gl000208_random,-11306,14S82M2I1M28S,7;chr19_gl000208_random,-48087,96M2I1M28S,7;chr7,+61967157,34S1M3I1M3I52M1I3M1D29M,7;chr7,+61973761,19S75M1I9M1D23M,7;chr7,+61974441,19S75M1I9M1D23M,7;chr7,+61975130,28S66M1I10M1D22M,7;chr7,+61967157,34S1M3I1M3I85M,13;chr7,+61973770,28S66M1I9M1D8M15S,0;chr19,+27732370,127M,8;chr19,+27732030,127M,12; NM:i:7

3) If I consider the alignment above: Why do I get so many XA:Z fields? They look like alternative alignments, but in the case, why isn't the first alternative the best alignment (chr19,+27732030,127M,7)? What options can I change to suppress such a large number of XA:Z outputs?

Any help is highly appreciated! Best regards, Urs

HongzheGuo commented 7 years ago

Really sorry for my late reply and I have reply your email, please check it