COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
751 stars 159 forks source link

Quantification of RNA seq reads from the chloroplast #798

Open FlorianRNA opened 1 year ago

FlorianRNA commented 1 year ago

Hi all.

I'm doing a fairly simple RNA Seq experiment right now, but ran into a problem when trying to quantify reads from the chloroplast (A. thaliana). For the entire analysis, I am using the nf-core/rnaseq pipeline (default parameters) which gives me Alignment (STAR) based counts from Salmon at the end and additional counts from FeatureCounts. The whole pipeline runs without problems and the mapping looks good. When I compare the number of reads between Salmon and FeatureCounts, the results are pretty much the same for the nucleus, but for the chloroplast they differ dramatically (e.g. psbI or ATCG00080: Salmon: 24; FeatureCounts: 9277). Looking at the mapping, I would say that FeatureCounts is definitely closer to reality here. When I dig deeper, it seems that the genes with the biggest difference between the two analyses are very small genes (e.g. psbI is only 111bp) or tRNAs (which should be removed by the prep method; these would be excluded from the analysis anyway). I have now tried several things: I used the nf-core/rnaseq pipeline, skipping the alignment part (--skip_alignment) and using Salmon as a pseudo-aligner (--pseudo_aligner: salmon), which gave results for the chloroplast reads that are quite close to the alignment-based determination by Salmon (e.g. psbI: 372). I also tried using the pseudo-alignment without the pipeline (indexing: salmon index -t input.fa -r transcripts_index -k 31 pseudomapping: salmon quant -i transcripts_index -l A -1 sequencingdata_R1.fastq.gz -2 sequencingdata_R2.fastq.gz --validate mapping - o transcripts.quant). The result was that the chloroplast numbers were now roughly between the salmon counts from the pipeline and the counts from the pipeline with FeatureCounts. I skipped the trimming part here, so differences are to be expected, but they still seem very high (e.g. psbI 2677). First, I have no idea where the difference between counts with and without pipelines comes from. I tried to find out what parameters the pipeline uses for Salmon, but I didn't find anything special there. And secondly, I wonder if Salmon has somehow problems with the chloroplast genome (super small, polycistronic and monocistronic units, partially small genes, high expression levels)? I know there are publications where Salmon has been used for chloroplast read counts, but I haven't found any information about their parameters.

I would be super happy if you guys could help me here. I could also just work with FeatureCounts, but I would like to understand my problems.

All the best Florian

rob-p commented 1 year ago

Hi @FlorianRNA,

Thanks for reporting this. I think it may also be worthwhile bringing this up with the nf-core people, as they would certainly be interested in hearing about this and understanding the likely source of this kind of discrepancy.

From what you've explained, here is my current hypothesis of what's going on.

Hopefully this helps answer your question about this behavior. If you end up discussing this with the nf-core folks, I'd be happy to be involved in that discussion as well.

Best, Rob

drpatelh commented 1 year ago

Hi @FlorianRNA ! As stated in the usage docs for the nf-core/rnaseq pipeline:

"Since v3.0 of the pipeline, featureCounts is no longer used to perform gene/transcript quantification, however it is still used to generate QC metrics based on biotype information available within GFF/GTF genome annotation files. This decision was made primarily because of the limitations of featureCounts to appropriately quantify gene expression data. Please see Zhao et al., 2015 and Soneson et al., 2015."

This is a common cause of confusion and I have tried to be as explicit about this in the docs. featureCounts is used to quantify features in the annotation by gene_biotype and not the actual gene / transcript features themselves. This may explain why you are seeing these discrepancies. However, I am still a little puzzled how you are able to directly compare the counts generated by featureCounts and Salmon (in either mode) because the core features that are being quantified should be different. Where did you get the plant reference genome from? If it's not from Ensembl then it probably isn't worth running the biotype quantification with featureCounts anyway because the GTF annotation files may not contain that information. There are some docs for this here.

Hope that helps and if you think we can improve the pipeline in any way please feel free to create an issue on the nf-core/rnaseq repo.

FlorianRNA commented 1 year ago

Hello @rob-p,

thank you very much for your quick and detailed answer!

Why are the salmon counts for this gene much lower when using alignment mode and mapping mode under nf-core?

This theory makes perfect sense and would explain the low counts for the short genes perfectly. When I check the bam files, I see quite a few reads that are only partially in the region of the particular feature. I looked at the meta_info.json:

"num_bootstraps": 0,
"num_processed": 36672829,
"num_mapped": 27737862,
"num_decoy_fragments": 2690181,
"num_dovetail_fragments": 59605,
"num_fragments_filtered_vm": 2897142,
"num_alignments_below_threshold_for_mapped_fragments_vm": 4117684,
"percent_mapped": 75.63600288376989,

It looks like the decoy percentage is substantial. I'll run the pseudo-aligment in decoy mode and take a look at the unmapped reads to see if there are many that could be mapped to the chloroplast genome (I'll update this later).

Why am I seeing much higher values for this gene with FeatureCounts?

I have now run FeatureCounts several times with different overlaps (minOverlap =25, minOverlap =50, minOverlap =75min Overlap =100) and indeed the counts have decreased (again the psbI example: 8685 , 6011, 4237, 1805 accordingly). Again, this is a good argument for the hypothesis put forward.

Why does running Salmon outside nf-core lead to much higher values?

Hopefully, after I run Decoy mode, this problem is solved.

I also tried mapping mode with the --softclipOverhangs option. That increased the counts (psbI : 4696 counts); playing around with the --minScoreFraction flag in addition to the --softclipOverhangs flag also increased the numbers ( minScoreFraction= 0 ->psbI = 8496; minScoreFraction= 0.5 ->psbI = 5633; minScoreFraction= 0.7 ->psbI =3627 ).

So, in summary, your explanation seems to be completely correct. In the case that decoy mode resolves the difference between the pipeline and the run outside the pipeline, I would not give this to the nf-core people. But I will if there are still large discrepancies after the run.

I'm still not sure what the best parameters are for my analysis, but the --softclipOverhangs flag seems to be the best option for me now. So thanks again!

@drpatelh

Thank you very much for your quick reply as well. I was a bit inaccurate when I said I used the FeatureCounts from the pipeline. I actually wasn't able to use the resulting .txt files. Instead, I used the resulting bam file from the pipeline to perform a FeatureCounts analysis on R. I hope this information answers the question of how I can compare the two results? My genome and gtf file are from EnsemblPlants, so they should be fine. In the MultiQC file, the vast majority of reads align to protein coding regions according to FeatureCounts, so I hope my primary files are fine.

Thanks again for your help and time!

All the best Florian

drpatelh commented 1 year ago

Ah, my bad. I assumed you were running featureCounts via the pipeline. Thanks for clarifying.

Happy to incorporate changes into the pipeline in the future if they improve the default behaviour. For now, you can tweak the settings you provide the pipeline to incorporate the --softclipOverhangs parameter. You can put the snippet below in a file called custom.config and pass to the pipeline on the CLI with -c custom.config:

process {
    withName: '.*:QUANTIFY_SALMON:SALMON_QUANT' {
        ext.args = '--softclipOverhangs'
    }
}

Let me know if you have any problems with this.