COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
771 stars 162 forks source link

Query on Salmon mapping parameters to quantify gene abundance in metagenomes #330

Closed Anto007 closed 4 years ago

Anto007 commented 5 years ago

I'm planning on using your wonderful Salmon tool v0.12.0 for generating TPM counts with a view to quantifying relative abundance of certain bacterial antibiotic resistance genes in my shotgun (human gut) metagenomes. So as to ensure strict mappings to the genes of interest, I would like to set the value of the flag '--minScoreFraction' to 0.90. Since its a metagenome with a truck-load of genes from several microbes, I plan to quantify only those genes that show >=90% identity at the nucleotide-level to the known antibiotic resistance genes (of interest). My question here really is whether setting the flag minScoreFraction to 0.90 achieves anything close to what I've in mind? Below is the full command line I used for Salmon-based quantification of the tetracycline resistance gene tetW.

salmon quant --meta -i amr_indices/tetW_index -l A -1 Corr_clean_phiclean_10_8_L001_R1_001.fastq.gz -2 Corr_clean_phiclean_10_8_L001_R2_001.fastq.gz -o 10_tetW_test_quant --mimicStrictBT2 --validateMappings --minScoreFraction=0.90

I would highly appreciate any feedback from you in this regard. Many thanks in advance for your time.

rob-p commented 5 years ago

Hi @Anto007,

Sounds like an interesting experiment! A couple of questions: (1) are you quantifying the meta-transcriptome or the metagenomes? What I mean is, are your target sequences the specific genes from the microbes, or the entire microbial genomes? Is the sequencing data RNA-seq from sequencing the mixture of expressed gene transcripts, or DNA-seq of the microbes? This will have an effect on how you expect reads to be generated.

The effect of --minScoreFraction depends, to some extent, on how you set the match/mismatch/gap parameters. With the default parameters, 0.9 is actually higher than 90% sequence identity, because the default mismatch penalty is twice the match score. If you assume only matches and mismatches, then the --minScoreFraction you want to set is the one such that x (match_score read_length) <= (match_score read_length) - (m read_length match_score) + (m read_length * mismatch_penalty), where m is the mismatch fraction (0.1 in your case). So, for example, if the match_score is 2 and mismatch penalty is -4, and the read length is 100, you want to set it so that:

x 200 = 200 - (0.1 100 2) + (0.1 100 * -4) = 200 - 20 - 40 = 140

so, the appropriate x would be ~0.7. Of course, if you want to make the calculation simple, you can set the match score to 1 and mismatch penalty to 0, and then the interpretation (modulo gaps) is straightforward (and 0.9 means what you want). Finally, I'd typically avoid using --mimicStrictBT2, since those are pretty harsh parameters. Of course, you could try mapping both with and without that flag and see how it affects your mapping rate.

Anto007 commented 5 years ago

Many thanks for your response. I'll try --minScoreFraction=0.70 without the --mimicStrictBT2 flag and see how it goes. To answer to your questions: 1) I'm interested in quantifying the abundance of specific bacterial genes in shotgun metagenomic libraries and hence, 2) the sequencing data is from DNA-seq of gut microbes.

YiJessePi commented 4 years ago

Hey @Anto007, Did you used Salmon to quantify abundance of custom genes in a meta-genomic sample? Do you recommend?

Anto007 commented 4 years ago

Yes @YiJessePi , I quite liked it!

mortonjt commented 3 years ago

Just out of curiosity @Anto007 did you have specific parameters to handle DNA instead of RNA? From what I understand, there are bias parameters that are specific to RNAseq - I would think that this would yield off-putting results if not accounted for, right?

Anto007 commented 3 years ago

Sorry for the late response. According to the developer, the --meta flag is well-suited to metagenomics. Unlike handling RNA data, the use of this flag changes the initialization conditions of the EM algorithm and turns off Salmon's rich equivalence classes so as to better optimize Salmon for handling metagenomic (DNA) data. You can read more here- https://gitter.im/COMBINE-lab/salmon?at=589f11106b2d8dd5522e0ff1

JSSaini commented 1 year ago

@Anto007 Can you please share a publication where you used salmon to quantify gene abundance in metagenomics data? thank you.

jolespin commented 1 month ago

@rob-p can you elaborate on this a bit more:

The effect of --minScoreFraction depends, to some extent, on how you set the match/mismatch/gap parameters. With the default parameters, 0.9 is actually higher than 90% sequence identity, because the default mismatch penalty is twice the match score. If you assume only matches and mismatches, then the --minScoreFraction you want to set is the one such that x (match_score read_length) <= (match_score read_length) - (m read_length match_score) + (m read_length mismatch_penalty), where m is the mismatch fraction (0.1 in your case). So, for example, if the match_score is 2 and mismatch penalty is -4, and the read length is 100, you want to set it so that: x 200 = 200 - (0.1 100 2) + (0.1 100 -4) = 200 - 20 - 40 = 140 so, the appropriate x would be ~0.7. Of course, if you want to make the calculation simple, you can set the match score to 1 and mismatch penalty to 0, and then the interpretation (modulo gaps) is straightforward (and 0.9 means what you want)

Would the two parameter sets mentioned above have the same effect assuming read length 100?

Also, it says Alevin has a default minScoreFraction of 0.87. Would it be safe to assume differentiating between isoforms with Alevin is a similar problem to differentiating between orthologous genes in metagenomics/transcriptomics?

Which parameters would be relevant to control for this?