amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

Truncated output #121

Closed ParkvilleData closed 5 years ago

ParkvilleData commented 5 years ago

Hi,

I am aligning a paired end fastq sample to a small reference and the sam output file looks fairly good. The issue is when converting from SAM to BAM using samtools I get the following error.

[bobbieshaban@spartan-snowy034 snap_extraction]$ samtools view -bS new.sam > new.bam
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] Parse error at line 9462
[main_samview] truncated file.

I checked out line 9462 (and performed a google search) and it seems that issue is related to reads aligning "off the edge" of the reference.. Well, that's partly it, it's mainly that the CIGAR and query sequence are different because of that.

I have tried running CleanSam (picard) but it doesn't seem to clean up the sam file.

Anyone know how to fix this? Of course I could just remove those lines from the sam file, there are ~1000 in a 1,000,000 read sample.

Thankyou for your help, Bobbie

bolosky commented 5 years ago

It’s probably a bug.

I’ve gotten a private report of a similar problem that I haven’t had a chance to look at yet. I’ll check them both out soon.

Can you send me an example read that caused this problem (and if it’s aligned against something other than a standard reference, where I can get the FASTA for at least the contig that it aligned to)? I’d like to repro it.

BTW, you don’t need to generate SAM output and then run it through samtools to make BAM. SNAP will write BAM natively. Just give it an output filename that ends in .bam, and you get BAM. It’s almost certainly faster than using samtools. SNAP will also sort and mark duplicates if you want, too. There’s a sort flag (and a separate flag to turn off duplicate marking if you’re sorting and don’t want it). Again, it’s almost certainly faster than using external tools.

--Bill

From: Parkville Data Workflow User Group notifications@github.com Sent: Monday, April 15, 2019 10:09 PM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] Truncated output (#121)

Hi,

I am aligning a paired end fastq sample to a small reference and the sam output file looks fairly good. The issue is when converting from SAM to BAM using samtools I get the following error.

[bobbieshaban@spartan-snowy034 snap_extraction]$ samtools view -bS new.sam > new.bam

[E::sam_parse1] CIGAR and query sequence are of different length

[W::sam_read1] Parse error at line 9462

[main_samview] truncated file.

I checked out line 9462 (and performed a google search) and it seems that issue is related to reads aligning "off the edge" of the reference.. Well, that's partly it, it's mainly that the CIGAR and query sequence are different because of that.

I have tried running CleanSam (picard) but it doesn't seem to clean up the sam file.

Anyone know how to fix this? Of course I could just remove those lines from the sam file, there are ~1000 in a 1,000,000 read sample.

Thankyou for your help, Bobbie

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F121&data=02%7C01%7Cbolosky%40microsoft.com%7Cc228923b7a2d487f358108d6c22995c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636909881201197253&sdata=LNBuPdTy9nfDG3Wf0GMvGRKrO%2BCTpz9km658xqTFKkY%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752beGZI_AcZq5VJj59FB-uzqS7pn_ks5vhVrVgaJpZM4cxXto&data=02%7C01%7Cbolosky%40microsoft.com%7Cc228923b7a2d487f358108d6c22995c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636909881201207246&sdata=UD%2BjgJnBHVux7a7WpvN9m22VzxNNtqHd3MpNwQefiKA%3D&reserved=0.

ParkvilleData commented 5 years ago

Hi, thanks for the reply! Sure, I can do that.

CP017623.1_4203 147     CP017623.1_Candida_albicans_SC5314_chromosome_1_sequence        2747059 60      85=1X40=        =       2745950 -1235   AGAAGACAACTTACTCCTTATGTGCCTTCTCATCATTAAATGATGAAAATAAACAAACAAACAAATAAACAAACAAATAAATAAA   BF/FFFF/FFBFBFFFF/FFFFFFFFFFFBFFF/FFF/FFFFFFFF<BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBFFFFF<FFFFFFFFFFFFFF/FFBFF/FFFBFFFFBB/BB  PG:Z:SNAP       RG:Z:FASTQ      NM:i:1

The link to the files can be found here: https://www.dropbox.com/s/xqwnpywzxl8v1qp/snap_repro.tar.gz

Would you know why the number of aligned reads is different in the snap output when compared to samtools flagstat?

writing sorted reads... 999990 reads in 32 blocks, 0 s (0 s read wait, 0 s write wait)
1    250 15     25      7      100.00%  97.35%  2.19%   0.46%   -       251693

My initial thinking was of the error mentioned above. This set is a mock data set created with insilicoseq and I've had about 96% alignment success with bowtie2 so I was hoping to get similar with snap. The snap output shows only %0.46 not found (which I am very happy with) but when running a check with samtools flagstat only 622,099 reads are aligned.

[bobbieshaban@spartan-snowy034 snap_alignment_update]$ samtools flagstat hiseq_reads_paired_snap_full.fasta.singles.alignment.bam
622099 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
622085 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Thanks for your help! Bobbie

bolosky commented 5 years ago

The attachments didn’t make it through the github forwarder. Could you please send them to me directly at bolosky@microsoft.commailto:bolosky@microsoft.com?

The output you show from SNAP looks odd. Here’s an example of what it should look like for a single-end run:

d:\temp>snap single \gdc\indices\hg38-20 \gdc\downloaded_files\0138500c-2293-431b-988a-93fcd2f04d4f\C828.TCGA-D3-A2JB-06A-11D-A196-08.2_gdc_realn.bam -map -pre -so -o delete-me.bam Welcome to SNAP version 1.0dev.100.

Loading index from directory... 33s. 3209514105 bases, seed size 20 Aligning. sorting...sorted 82681446 reads in 102 blocks, 244 s read wait align 1.193 s + merge 0.015 s, read release align 0.000 s + merge 6.378 s write wait 87.856 s align + 0.193 s merge, write filter 27.155 s align + 186.422 s merge Total Reads Aligned, MAPQ >= 10 Aligned, MAPQ < 10 Unaligned Too Short/Too Many Ns Reads/s Time in Aligner (s) 82,681,446 68,655,107 (83.04%) 11,660,264 (14.10%) 170,459 (0.21%) 2,195,616 (2.66%) 496,492 167

In this case, it failed to align 0.21% of the reads, and didn’t try on 2.66% because they were too short or had too many Ns in them. 14.10% of the reads aligned with a low MAPQ and so are questionable. The sum of reads in columns 2-5 should equal that in column 1, and should equal the number in the output, assuming you didn’t ask for filtering or multiple alignments (which you get using other flags to snap).

--Bill

From: Parkville Data Workflow User Group notifications@github.com Sent: Tuesday, April 16, 2019 5:40 PM To: amplab/snap snap@noreply.github.com Cc: Bill Bolosky bolosky@microsoft.com; Comment comment@noreply.github.com Subject: Re: [amplab/snap] Truncated output (#121)

Hi, thanks for the reply! Sure, I can do that.

CP017623.1_4203 147 CP017623.1_Candida_albicans_SC5314_chromosome_1_sequence 2747059 60 85=1X40= = 2745950 -1235 AGAAGACAACTTACTCCTTATGTGCCTTCTCATCATTAAATGATGAAAATAAACAAACAAACAAATAAACAAACAAATAAATAAA BF/FFFF/FFBFBFFFF/FFFFFFFFFFFBFFF/FFF/FFFFFFFF<BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBFFFFF<FFFFFFFFFFFFFF/FFBFF/FFFBFFFFBB/BB PG:Z:SNAP RG:Z:FASTQ NM:i:1

I've attached the reference, SRS121011.fasta and the set of paired end reads.

Would you know why the number of aligned reads is different in the snap output when compared to samtools flagstat?

writing sorted reads... 999990 reads in 32 blocks, 0 s (0 s read wait, 0 s write wait)

1 250 15 25 7 100.00% 97.35% 2.19% 0.46% - 251693

My initial thinking was of the error mentioned above. This set is a mock data set created with insilicoseq and I've had about 96% alignment success with bowtie2 so I was hoping to get similar with snap. The snap output shows only %0.46 not found (which I am very happy with) but when running a check with samtools flagstat only 622,099 reads are aligned.

[bobbieshaban@spartan-snowy034 snap_alignment_update]$ samtools flagstat hiseq_reads_paired_snap_full.fasta.singles.alignment.bam

622099 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

622085 + 0 mapped (100.00% : N/A)

0 + 0 paired in sequencing

0 + 0 read1

0 + 0 read2

0 + 0 properly paired (N/A : N/A)

0 + 0 with itself and mate mapped

0 + 0 singletons (N/A : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

Thanks for your help! Bobbie

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F121%23issuecomment-483893659&data=02%7C01%7Cbolosky%40microsoft.com%7C571ab557606d46e95eff08d6c2cd2f22%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636910583847891048&sdata=6GidmAvC%2FRL9IpyJjZ%2FpR6I3eYu3Po2AoLPn0B7IUPA%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752U34DyStCUUdp9EAhxjVHuG7Q-rdks5vhm1OgaJpZM4cxXto&data=02%7C01%7Cbolosky%40microsoft.com%7C571ab557606d46e95eff08d6c2cd2f22%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636910583847891048&sdata=OImnpOayeaAU4wY2H4%2FnDqTpuiZAYbF4mlYfaTYsouY%3D&reserved=0.

bolosky commented 5 years ago

This is fixed in dev.101.

The problem was that one of the reads contained a K base in a read that was aligned as a reverse complement. SNAP didn't know how to reverse complement a K, so it just turned it into a binary 0, which in C++ is the string terminator; this truncated the sequence in the output SAM file.

I fixed it to know the complement of K (and the other weird bases), though the actual aligner will treat these as Ns. I also fixed it so that if it sees a base that's not on the list of meaningful bases in the Wikipedia FASTA page that it will just turn into an N, so this should never happen again, even if there are new bases defined that I don't know about now.

I don't have access to the snap website at Berkeley to push a new binary, so you'll have to build this from the GitHub sources in the dev branch (git checkout dev followed by make).