COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
778 stars 165 forks source link

Locate reads for transcriptional variants #528

Open davidhbrann opened 4 years ago

davidhbrann commented 4 years ago

Thank you for this helpful software.

I'm looking to look at the raw fastq reads for transcripts that came up as differentially expressed after quantification with Salmon. Is there a good way to link the original reads back with the estimated counts from Salmon for each transcriptional variant of a particular gene? Ideally I'd want to find all the reads that were thought to be most-likely aligned to each variant (or the likelihood that each read was aligned to each variant).

For instance, I ran salmon as follows:

salmon quant -i $S_IDX -l A \
-1 $IN_FQ1 -2 $IN_FQ2 \
-p $CORES \
--validateMappings \
--numGibbsSamples 20 \
--gcBias \
--no-version-check \
-o $SALMON_FOLD \
--writeMappings | samtools view -Sb - > $PREFIX"salmon_pseudobam.bam"

However, I'm afraid I don't understand the format of the resulting peudosam file. Is it described anywhere in the documentation? I don't see anything here, for instance. What do the flags and qualities represent?

More importantly, is there a way to filter the pseudobam files to find the reads corresponding to the counts/NumReads in the quant.sf output file? Do the normal samtools quality and flag filters work to subset e.g. uniquely-mapped reads, or do those concepts not really apply to these pseudobams?

shangguandong1996 commented 4 years ago

I am also confused about it. It seems that peudosam can not be converted into bam because of lacking of location.

If I just use the

samtools sort -O bam -@ 30 -o sort.bam Mapping.sam 

samtools index sort.bam

I can not load the sort.bam into IGV.

But I did find the two issue: #475 and #38 , which mentioned bam file.

rob-p commented 4 years ago

Hi @dhb2128 and @shangguandong1996,

To answer your first question @dhb2128:

Is there a good way to link the original reads back with the estimated counts from Salmon for each transcriptional variant of a particular gene? Ideally I'd want to find all the reads that were thought to be most-likely aligned to each variant (or the likelihood that each read was aligned to each variant).

There is not currently a "trivial" way to do this. The alignments in the SAM file represent all of the potential mapping loci that salmon considers for each read. The alignment score tag AS tells you the quality of the sequence-level match, and so alignments with a higher AS score are more likely to be valid loci for a read than those with lower scores. However, there will often be multiple alignments for a read with the same AS score. In that case, it is possible to compute the (posterior) probability for a read having arisen from a specific transcript if you have the salmon abundances in hand, but that requires re-parsing and analyzing the files together.

To answer your other questions:

However, I'm afraid I don't understand the format of the resulting peudosam file. Is it described anywhere in the documentation? I don't see anything here, for instance. What do the flags and qualities represent?

It is just a SAM file without CIGAR strings. The flags have the same (normal) interpretation for SAM records. However the CIGAR strings are not meaningful (apart from what is required for the file to undergo valid processing with samtools). The records additionally contain tags about the number of targets to which a fragment multi-maps, and the alignment score of the read pair to the current target (in the AS flag). The records themself are just normal SAM records, but with a trivial CIGAR strong.

More importantly, is there a way to filter the pseudobam files to find the reads corresponding to the counts/NumReads in the quant.sf output file? Do the normal samtools quality and flag filters work to subset e.g. uniquely-mapped reads, or do those concepts not really apply to these pseudobams?

There is no easy way to filter so the above condition is satisfied, as the NumReads are obtained by proportional allocation of the reads according to the underlying probabilistic model of salmon. Specifically, the NumReads column of the quantification file corresponds to summing over the expectation of all of the latent variables that represent fragment to transcript assignment so that, apart from uniquely-mapped reads, there is no way to say that a fragment definitely came from a transcript.

However, you should still be able to easily filter out uniquely mapped reads, and you can interpret them in a relatively unambiguous way. Also, you could filter on the AS tag as well. For a given read, if there is a single transcript where the AS value is much larger than the others for this read, it is overwhelmingly likely that the read originated from the transcript with the unique best AS score.

@shangguandong1996 : The SAM file does contain positions (and orientations, and alignment scores) for each read. It is simply that the positions are with respect to the transcriptome and not with respect to the genome.

davidhbrann commented 4 years ago

Thanks for the response.

The transcriptional variants I've been interested often are quite similar (e.g. only differ for a small part of one exon). Therefore, many of the reads (especially when they map to parts of the genes that don't differ) show up as pseudoaligned to multiple variants, as you'd expect. In that case, do you suggest only looking at the uniquely mapped reads, or only looking at primary alignments for each read, or still looking at all reads (perhaps with a certain AS score) for a given transcript? I'm mostly interested in performing sanity checks that transcriptional variants identified by Salmon/Swish are differentially used across conditions. Or would it be better to use a tool like DEXSeq to asks these questions directly?

Also, when filtering by the AS, I found some reads with AS:i:-2147483648, which I assume is an overflow error.

rob-p commented 4 years ago

The AS:i:-2147483648 is a sentinel value basically meaning the alignment was below the minimum acceptable quality. You can simply ignore those (its the min signed 32-bit integer).

Let me think about your other question (and ping @mikelove), and get back to you.

shangguandong1996 commented 4 years ago

@rob-p Thanks for your explantion, I understand :)