amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
288 stars 66 forks source link

Misalignment in RNA-seq data #5

Closed rnpandya closed 12 years ago

rnpandya commented 12 years ago

From Andrew Magis andrewmagis@gmail.com

Hello,

I've been aligning RNA seq reads to the GRCh37 transcriptome and I've come across a very strange alignment error. SNAP is reporting an alignment in the output file, even though the statistics output show that nothing should be aligned. When I check the alignment that SNAP reports against the target sequence, the alignment is incorrect. Oddly enough, this only occurs when two specific sequences are included in the index file, but does not occur when either of the sequences are in the index alone. I have attached all files discussed below so you may replicate the error. I am using the latest version of SNAP (0.13.2).

PRUNE2_SEC61B_1.fa contains two transcript sequences (targets). I built this into a SNAP index using the following command:

$ snap index PRUNE2_SEC61B_1.fa prune2_sec61b_1

err2.fastq contains the query read (100nt). I queried the SNAP index using the following command:

$ snap single prune2_sec61b_1 err2.fastq -o prune2_sec61b_1.sam

SNAP reports the following in its output:

Welcome to SNAP version 0.13.2.

Loading index from directory... 0s. 2624 bases, seed size 20 ConfDif MaxHits MaxDist MaxSeed ConfAd %Used %Unique %Multi %!Found %Error Reads/s 2 250 8 25 4 100.00% 0.00% 0.00% 100.00% - 11

Note that %!Found is 100%, indicating no alignments were found. However, when I check the output file (prune2_sec61b_1.sam), I see the following:

HWI-ST1155:101:C0N08ACXX:1:2208:12842:121468_2:N:0:CGATGT 0 |protein_coding|9|79251651|79320397|-|ENST00000223609|ENSG00000106772|PRUNE2 2067 0 1I1=1I1=1X1=1D2X2=1D1X1=1X87= * 0 0 GCCAGCTGCAGGTCTTTCGGGGGCTCCGTAACTTTCTATCCGTCCGCGTCAGCGCCTTGCCACCCTCATCTCCAATATGCCTGGTCCGACCCCCAGTGGC CCCFFFFFHHHHFIJIJJJJJJGIEIJJHHIJJJJJIJJJHHHHFFDDDDDDDDDDDDDCDCCDDDDDDDDDDDDDDDEDCACBCDDDDDDDDD@BCCCB

This indicates that an alignment was made to the first sequence (ENST00000223609) at position 2067. This is odd, because the length of that sequence is only 2079, and even the overlapping nucleotides do not match.

Even stranger, this error is dependent on the presence of the second sequence in the database file (ENST00000223641). The query read is complementary to the first 87 nucleotides of this sequence. When I remove this second read from the database, and again create the index from the resulting file (PRUNE2.fa), again it shows that there are no alignments in the statistics:

$ snap index PRUNE2.fa prune2

$ snap single prune2 err2.fastq -o prune2.sam

Welcome to SNAP version 0.13.2.

Loading index from directory... 0s. 2079 bases, seed size 20 ConfDif MaxHits MaxDist MaxSeed ConfAd %Used %Unique %Multi %!Found %Error Reads/s 2 250 8 25 4 100.00% 0.00% 0.00% 100.00% - 13

And when I check the output file (prune2.sam), I see the query sequence was (correctly) not aligned to anything:

HWI-ST1155:101:C0N08ACXX:1:2208:12842:121468_2:N:0:CGATGT 4 * 0 0 * * 0 0 GCCAGCTGCAGGTCTTTCGGGGGCTCCGTAACTTTCTATCCGTCCGCGTCAGCGCCTTGCCACCCTCATCTCCAATATGCCTGGTCCGACCCCCAGTGGC CCCFFFFFHHHHFIJIJJJJJJGIEIJJHHIJJJJJIJJJHHHHFFDDDDDDDDDDDDDCDCCDDDDDDDDDDDDDDDEDCACBCDDDDDDDDD@BCCCB

I have repeated the index also calculating table bias (-c), using different seed sizes, etc. The error still occurs. However, it does require the distance (-d) to be at least 8, anything less and the bad alignment does not occur.

An interesting note: the query read is actually partially complementary to the second target sequence (the first 87 nucleotides). In the bad alignment, SNAP reports the following CIGAR operator:

1I1=1I1=1X1=1D2X2=1D1X1=1X87=

which shows complementarity of 87 nucleotides following a series of insertions, deletions and mismatches. It almost looks like SNAP found an alignment for the second sequence, but mistakenly printed that output to the first sequence.

The vast majority of the alignments are correct, when I run SNAP against my entire RNA-seq dataset and transcriptome. However, several are misaligned like this one, though I have not done a comprehensive search for them all. At least 6 are misaligned near this position in the gene PRUNE2, and I have seen several others in other genes as well.

Thanks for your help,

Andrew

bolosky commented 12 years ago

There are two different things going on here. First is that internally SNAP represents the reference genome as a single string, and due to some unimplemented code was allowing reads to be mapped across a chromosome boundary. I fixed that, although there's still an issue with the case where a read would map to the beginning of a chromosome with an insertion, or the end with a deletion. Currently, SNAP won't map these reads.

The second problem was that the SAM writer more-or-less ignored the aliger's return status, and so would generate a mapping for a read that the aligner said was "NotFound' if the aligner happened to put in a genome offset. This happened when there were seeds that hit, but that had too large of an edit distance to qualify as good.

Both of these will be fixed in 0.13.3, which will be checked in once it finishes a test pass.

bolosky commented 12 years ago

I just pushed 0.13.3, which fixes both of these problems.