COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
779 stars 165 forks source link

low mapping rate ? #160

Open atasub opened 7 years ago

atasub commented 7 years ago

I recently ran Salmon by quasi-mapping-based mode and when I checked the salmon_quant.log file, saw that mapping rate was around ~%65-68 for all of the samples. Do you have any suggestions to improve the mapping rate? I used "--libType A" to to infer the library type info and got a warning that "Greater than 5% of the fragments disagreed with the provided library typ", but I guess this is not an issue. This is an example for one of the "lib_format_counts.json" files:

{
    "read_files": "( /mnt/dznehomes/homes/simonj/RNAseq_pipeline/frontal_data/samples/Trimmed_FASTQ_files/00116_GFM_R1_trimmed.fastq.gz, /mnt/dznehomes/homes/simonj/RNAseq_pipeline/frontal_data/samples/Trimmed_FASTQ_files/00116_GFM_R2_trimmed.fastq.gz )",
    "expected_format": "ISR",
    "compatible_fragment_ratio": 0.9241470144855659,
    "num_compatible_fragments": 34584460,
    "num_assigned_fragments": 37423115,
    "num_consistent_mappings": 334748580,
    "num_inconsistent_mappings": 28046150,
    "MSF": 0,
    "OSF": 32448,
    "ISF": 20518131,
    "MSR": 0,
    "OSR": 487250,
    "ISR": 334748580,
    "SF": 1833525,
    "SR": 5088606,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0
}
rob-p commented 7 years ago

Hi @atasub,

It's hard to say exactly if this mapping rate is much lower than expected or not. Many RNA-seq experiments do end up with a mapping rate of 65-70%. One thing that might contribute to a lower mapping rate would be short reads relative to the minimum required exact match length (default of 31). If your reads are relatively short (after trimming, which it looks like you are doing here) --- say ~50bp, then one might try lowering the k value with which the index is built. This will allow more sensitive mapping.

However, the other thing to try is simply to align one of these samples to the genome with a tool like STAR or HISAT2 and look at their mapping rate to known features. If it's similar, then the other reads could be accounted for by e.g. intron retention or even contamination. Finally, @vals has an excellent series of blog posts on investigating and addressing low mapping rates (albeit in single-cell data) that you might find useful. Let me know what you find.

atasub commented 7 years ago

Hi @rob-p , thank you for your reply, it was very helpful. I had used HISAT2 before and got overall alignment rate around ~%97-99. So, in this case would you recommend to use alignment-based mode using HISAT2 based bam files?

roryk commented 7 years ago

Almost always when I've seen stuff that is a low mapping rate to RNA and a high genomic mapping rate, the culprit is the sample failed and had little to no RNA in it, and what actually got sequenced was DNA. I'm guessing if you'll see a lot of intergenic reads in your hisat2 alignments.

vals commented 7 years ago

Hi @atasub

If you're using the same reference and gene annotation for HISAT2 and Salmon but getting lower mapping rate with Salmon, you probably have some DNA contamination.

You should get the same gene expression results from either strategy. (Because in the end the GTF file for the genome and the Fasta file for the transcriptome are equivalent).

hiraksarkar commented 7 years ago

Hi @atasub , This is interesting, can you tell us a little about the read sequence size. I am currently looking into such RNA-seq files which have bad mapping rates, so curious to know a little more. Also can you try running salmon in selective alignment mode, not sure if that improves mapping rate or quantification, but it is worth a try. https://github.com/COMBINE-lab/salmon/tree/selective-alignment, the associated pre-print is here.

roryk commented 7 years ago

Another red flag would be a high rRNA rate going along with it-- the rRNA depletion methods don't work 100%, and if you have no mRNA then the rRNA rate will tend to be higher.

vals commented 7 years ago

Yes that is also a good explanation, I recommend putting human rRNA in the Salmon index.

InesdeSantiago commented 6 years ago

@hiraksarkar How can I use selective mapping in Salmon? I dont see any info in the docs (http://salmon.readthedocs.io/en/latest/salmon.html?highlight=selective)

Say, I have paired-end data, I do: ./bin/salmon quant -i transcripts_index -l -1 reads1.fq -2 reads2.fq -o transcripts_quant

How can I specify selective mapping over quasi-mapping?

hiraksarkar commented 6 years ago

@InesdeSantiago Sorry, that the docs are not updated yet. It's definitely on our to-do list. The selective alignment needs a separate index from normal quasi-index. The steps are as following, If you are in salmon root folder, the most updated branch that implements Selective Alignment is, this branch

git checkout rescue-orphan (re-build it) build/src/salmon index -i selective_alignment_ind -t transcript.fa salmon quant -i selective_alignment_ind -la -1 reads1.fa -2 reads2.fq -o transcript_quant --softFilter --editDistance 4 --rangeFactorization 4

We strongly recommend these options while using selective alignment, as they tend to produce superior result almost always (I am considering them making default soon :) )

Please let me know if you face problem in any of the above steps, or if the results are not expected. Thanks again for using selective alignment and Salmon.

InesdeSantiago commented 6 years ago

@hiraksarkar Thanks! So, apart from the extra options (softFilter, editDistance, rangeFactorization) the only difference is the indexed genome file?

hiraksarkar commented 6 years ago

@InesdeSantiago That is correct. Just to make it sure, Salmon is not designed for the genome, so probably you want to use it only with transcriptome.

InesdeSantiago commented 6 years ago

@hiraksarkar. Yes, force of habit, I meant the transcriptome! ;-)

RaymondSHANG commented 5 years ago

Hi, I have a similar case, with 30~40% mapping rate by Salmon. I tried hisat2, the mapping rate goes to >80%. samtools sort the sam files to bam, and them qualimap2 gives me the QC results:

Exonic: | 31,212,828 / 41.39% Intronic: | 39,191,136 / 51.97% Intergenic: | 5,008,406 / 6.64% Intronic/intergenic overlapping exon: | 6,243,753 / 8.28%

There is not too much DNA contamination, but a large portion of intronic mappings. What can I do with these data? Any suggestions?