COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
776 stars 164 forks source link

Questions about salmon mapping rate #482

Closed lauraht closed 4 years ago

lauraht commented 4 years ago

Hi Rob,

I have a few questions regarding salmon mapping rate and would appreciate your advice.

(1) Generally, is the mapping rate of using selective alignment expected to be lower than that of using the previous quasi-mapping?

I am currently using salmon 1.1.0. I re-built the index of transcriptome using v1.1.0 with the default options (I did not use the --decoys option). I have a few datasets on which I had run the older salmon version 0.9.1 before, so I can compare their mapping rates with the current version. For both versions, I used the default options when running salmon quant (using “-l A”). The following is the comparison of mapping rates:

Dataset v0.9.1 v1.1.0
ERR1586786 59.1893% 59.1338%
ERR2399747 59.5182% 58.2916%
SRR9698990 67.3904% 57.9677%
SRR10883703 40.4469% 40.5271%

Except for the last dataset (in which the new version has a slightly higher mapping rate than the old version), the new version has lower mapping rates than the old version. For dataset SRR9698990, the mapping rate of the new version is considerably lower than that of the old version.

Since v1.1.0 uses selective alignment as default while v0.9.1 uses quasi-mapping, I was wondering whether it is an expected behavior that the mapping rate of using selective alignment (without using --decoys in index) is generally lower than that of using quasi-mapping?

(2) I ran the new salmon version v1.1.0 on a set of SRA datasets (fastq). The resulting mapping rates fall in all ranges (from 0.0001% to 99%). Some datasets did not get any mapping. For example, the following datasets got very minimal mapping rates:

Dataset Mapping rate
SRR9007475 0.00208642%
SRR5866113 0.0171606%
SRR448056 0.0524797%
SRR1535539 0.00304504%
SRR3129941 0.000626134%

And the following datasets did not get any mapping (“salmon was only able to assign 0 fragments to transcripts in the index”): SRR764657 SRR067901 SRR2913241 SRR1182629

Those are all human RNA-seq datasets using Illumina. I used GRCh38 transcriptome (Homo_sapiens.GRCh38.cdna.all.fa) from Ensembl (without using --decoys in index).

I was wondering what may be the possible reasons for those datasets to have minimal or zero mapping rates?

(3) I plan to use the salmon quantification results of this set of SRA datasets to obtain the gene expression levels of these datasets (by using tximport). Since the resulting mapping rates fall in all ranges with a considerable number of datasets having quite low mapping rates, I was wondering approximately above what level of mapping rate a dataset could be considered as “useful” for getting the gene expression levels (based on your experience)? In other words, what mapping-rate threshold would you think could be used to filter out “not useful” datasets? I was considering 40% mapping rate as a possible threshold.

Thank you very much for your help!

rob-p commented 4 years ago

Hi @lauraht,

Thanks for the detailed question (with data!). I will try to answer these in order.

(1) The mapping rate of selective-alignment is expected to generally be similar to that of quasi-mapping, but there are some important exceptions. You can find some aggregate statistics in supplementary figure 1 of the pre-print that introduces selective alignment

image

Here, "orcale" is a method that aligns the reads both to the transcriptome with Bowtie2 and the genome with STAR, and removes reads from the Bowtie2 bam that seem to be spuriously aligned to the transcriptome (they align to the genome outside of an annotated transcriptome with a better score than that assigned by Bowtie2 within the transcriptome). Here, you can see that in most cases most methods map a similar number of reads, but there are definitely samples where methods map more reads than the oracle, and sometimes quasi-mapping maps quite a few more. This is, to a large extent, because it doesn't validate those mappings and some of them may be spurious (i.e. the exact matches used to find the mapping in the given location would not support a high quality alignment at that location).

(2) This is certainly possible that some samples get very little to no mapping. However, there are a few points worth noting about how the data are processed that is worth being aware of before you write such samples off.

(3) This is a great question, and I don't have a "standard" practice for such things. Generally, I might consider setting a threshold based on the number of reads that could be mapped rather than the relative mapping rate itself. For example, if you have a really large sample, even a lower mapping rate may be OK if the total number of mapped reads is respectable. However, "respectable" is a weasel word here, and I don't have any specific suggestion as to what number would be ideal. I also don't think that doing it based on mapping percentage is a bad idea either, and in that case, requiring at least a 30-40% mapping rate could certainly be argued to be a reasonable QC metric.

Please let me know if you dig into any of the above (specifically the points raised in (2)) and find anything interesting or would like to discuss further.

Best, Rob

lauraht commented 4 years ago

Hi Rob,

Thank you so much for your very thorough and comprehensive explanations (with statistics) and advice! I really appreciate it!

I have looked into the meta_info.json files of these datasets. It looks like none of them has dovetail fragments. The following are the relevant information from the meta_info.json for each dataset:

For the 5 datasets with minimal mapping rates:

SRR9007475:

    "num_processed": 64991581,
    "num_mapped": 1356,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 1272836,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 3823916,

SRR5866113:

    "num_processed": 8065000,
    "num_mapped": 1384,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 5154685,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 165001284,

SRR448056:

    "num_processed": 13530942,
    "num_mapped": 7101,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 54,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 14420,

SRR1535539:

    "num_processed": 8045873,
    "num_mapped": 245,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 111743,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 593521,

SRR3129941:

    "num_processed": 57495682,
    "num_mapped": 360,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 358659,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 6631423,

For the 4 datasets with 0 mapping rate (they do not have other info besides "num_processed" and "num_mapped"):

SRR764657:

    "num_processed": 28342632,
    "num_mapped": 0,

SRR067901:

    "num_processed": 3571366,
    "num_mapped": 0,

SRR2913241:

    "num_processed": 40070874,
    "num_mapped": 8,

SRR1182629:

    "num_processed": 15381872,
    "num_mapped": 0,

I will dig into the other possibilities that you suggested when I get time, and post the update.

Thank you very much for all your help!

rob-p commented 4 years ago

Hi @lauraht,

So I decided to explore just one of these to see if I could figure out what might be going on. The below is with respect to SRR9007475. So first, even though I processed the data with the latest version of the develop branch (which will become 1.2.0), I got basically identical results to what you reported. Simply aligning the data against an index built on a human Gencode v26 transcriptome (with no decoys) gives me a mapping rate of 0.00378202832148367%.

The first thing I did was to quality and adapter trim the data (using fastp -i SRR9007475.fastq.gz -o SRR9007475_trimmed.fastq.gz -q 10 -w 8) and ... whoa. This is the fastp html report fastp.html.zip. So the first astounding statistic, the mean read length before trimming is 51bp (these are relatively short single-end reads). The mean read length after trimming is 21bp! So, the average read length is, in fact, less than the k-mer length used for indexing (default is k=31). On the trimmed data, the mapping rate goes up to 2.3545475882931305%, still very low, but now there's somewhat of an explanation, the average read is shorter than a single k-mer.

So, the next thing I tried was indexing with a smaller k; a really small one in this case,k=15. Then, I re-ran on the trimmed reads (the fact that the trimming took us from 51-21bp suggests that the reads had a lot of low quality bases, adapter contamination, or both). Under this setting, I still get a very low mapping rate, but it was much higher — 16.766993524863488%.

The final thing I tried was seeing how the mapping rate changed as I altered --minScoreFraction, which is the salmon parameter that determines the alignment score that a read must achieve in order to be mapped validly. The default is 0.65. This means that the read cannot have a score < 0.65 * the maximum achievable score for the read given it's length. In the case of a 21bp read, the best score would be a score of 42, so a read must obtain a score >= 27 in order to be mapped. This is already a pretty poor mapping, but I reduced it even more to 0.3 (so any read with a score > 12 would pass). This led to a mapping rate of ~46%. However, at this point, I'm not sure I would be confident in such mappings. For example, the situation here would be a 21bp read with multiple mismatches and, much of the time, one or more indels.

So, my conclusion, at least on this sample, is that the main issue is data quality. Trimming the reads and indexing with a smaller k can lead to a mapping rate ~16%, and then allowing really bad alignments can take it up to ~45% (and even more — when I set --minScoreFraction to 0.1, I get a mapping rate of 57%). But the level of confidence that one might derive from poorly-aligned 21bp reads is (and probably should be) quite low. I can't say, of course, that this is the situation with the other samples, as I've not looked at them. However, for this sample, and likely for some or all of the others, there is likely a data quality issue. So, perhaps the first order of business should be thinking about what level of data quality is worth trying to extract a signal from, and where the line is to determine if a sample is worth the trouble.

lauraht commented 4 years ago

Hi Rob,

You actually did a very thorough investigation on this dataset for me! I greatly appreciate it!

From what you found out, it looks like salmon’s mapping rate (with default parameters) could be a quite reasonable indicator of the dataset’s quality. This is very helpful information.

Thank you so much again for your time and help!

rob-p commented 4 years ago

Hi @lauraht,

From what you found out, it looks like salmon’s mapping rate (with default parameters) could be a quite reasonable indicator of the dataset’s quality.

Yes, I think so too. The one small modification I might put here is "salmon’s mapping rate (with default parameters, and after trimming)". Otherwise, I think this makes sense, as salmon's alignment procedure will try quite hard to find a suitable alignment for the read if one exists.

If there's no objection, I'll close this issue as I think the main question has been resolved. However, if you have a related question, or run into something else, please feel free to open up a new issue.