Quantification of RNA seq reads from the chloroplast

Hi all.

I'm doing a fairly simple RNA Seq experiment right now, but ran into a problem when trying to quantify reads from the chloroplast (A. thaliana). For the entire analysis, I am using the nf-core/rnaseq pipeline (default parameters) which gives me Alignment (STAR) based counts from Salmon at the end and additional counts from FeatureCounts. The whole pipeline runs without problems and the mapping looks good. When I compare the number of reads between Salmon and FeatureCounts, the results are pretty much the same for the nucleus, but for the chloroplast they differ dramatically (e.g. psbI or ATCG00080: Salmon: 24; FeatureCounts: 9277). Looking at the mapping, I would say that FeatureCounts is definitely closer to reality here. When I dig deeper, it seems that the genes with the biggest difference between the two analyses are very small genes (e.g. psbI is only 111bp) or tRNAs (which should be removed by the prep method; these would be excluded from the analysis anyway). I have now tried several things: I used the nf-core/rnaseq pipeline, skipping the alignment part (--skip_alignment) and using Salmon as a pseudo-aligner (--pseudo_aligner: salmon), which gave results for the chloroplast reads that are quite close to the alignment-based determination by Salmon (e.g. psbI: 372). I also tried using the pseudo-alignment without the pipeline (indexing: salmon index -t input.fa -r transcripts_index -k 31 pseudomapping: salmon quant -i transcripts_index -l A -1 sequencingdata_R1.fastq.gz -2 sequencingdata_R2.fastq.gz --validate mapping - o transcripts.quant). The result was that the chloroplast numbers were now roughly between the salmon counts from the pipeline and the counts from the pipeline with FeatureCounts. I skipped the trimming part here, so differences are to be expected, but they still seem very high (e.g. psbI 2677). First, I have no idea where the difference between counts with and without pipelines comes from. I tried to find out what parameters the pipeline uses for Salmon, but I didn't find anything special there. And secondly, I wonder if Salmon has somehow problems with the chloroplast genome (super small, polycistronic and monocistronic units, partially small genes, high expression levels)? I know there are publications where Salmon has been used for chloroplast read counts, but I haven't found any information about their parameters.

I would be super happy if you guys could help me here. I could also just work with FeatureCounts, but I would like to understand my problems.

All the best Florian

Hi @FlorianRNA,

Thanks for reporting this. I think it may also be worthwhile bringing this up with the nf-core people, as they would certainly be interested in hearing about this and understanding the likely source of this kind of discrepancy.

From what you've explained, here is my current hypothesis of what's going on.

Why are the salmon counts much lower for this gene when using alignment-mode and mapping mode under nf-core?
- Though the behavior you observe is similar, I think the cause is somewhat different. In alignment mode, STAR is used for alignment. The alignments are made against the genome and then projected onto the annotated transcriptome. STAR has many internal rules for when an alignment can be successfully projected or not. In this case, STAR limits the number of soft clips it will permit in an alignment that it reports to be valid with respect to the annotated transcriptome. I am guessing that many alignments overhang the end of the annotated transcripts, and so STAR does not project them to the transcriptome and so salmon cannot count them. In mapping mode, the nf-core pipeline makes use of salmon's selective-alignment with decoy sequences. The main purpose of this is to avoid spurious mapping to transcriptomic sequences that may be similar to other unannotated sequences in the genome that are nonetheless a better match for the read (e.g. an unannotated possibly transcribed pseudogene). The way this works in practice is that both the transcript sequences themselves and the full genome are indexed. Any read that aligns strictly better to the genome than the transcriptome is considered to map to a decoy, and is not used for the purposes of quantification. Consistent with the behavior I hypothesized above for STAR, if you have many softclipped bases at the end of the read that nonetheless match what is in the genome downstream of the end of the annotated transcript, you'll likely see these reads assigned as decoys. To check this, you can look at salmon's meta_info.json output file to see how many reads were mapped best to decoys.
Why do I see much higher counts for this gene with FeatureCounts?
- It depends on the specific behavior you invoke. However, my guess is that FeatureCounts is being run with flags such that reads that only somewhat overlap a feature are nonetheless assigned to it. This suggests that while no good alignment may actually exist to the annotated transcript, FeatureCounts is still assigning the read to that feature because it overlaps it to some degree and matches the corresponding location on the genome. Again, you can test this by changing the required overlap fraction of FeatureCounts.
Why does running salmon outside of nf-core produce much higher counts?
- Since you are indexing just the transcriptome, and not including the genome as decoy sequence (as is done in nf-core), then the only thing that will prevent reads from being assigned to the gene in question is if so much of the read overhangs off the end of the annotated transcript that no mapping matches the minimum required alignment score. This is likely to be a much more liberal threshold than what STAR allows, so it also explains why you see higher counts than when alignment mode is used.
Other thoughts / suggestions?
- So, there are several things that you might consider doing if you believe the correct behavior in your case is to assign these reads to such genes. First, when run in mapping mode, salmon has a --softclipOverhangs flag that will further reduce the penalty for reads overhanging the annotated end of a transcript. This will allow more reads to map to the transcript even if they can't obtain a good alignment score. Likewise, you can combine this with further reducing the required minimum score using the --minScoreFraction parameter. Finally, looking forward, we have developed and been testing even more comprehensive solutions to cases when one wants to allow large amounts of soft-clipping (see e.g. this tutorial). While those features have not yet been migrated into the main salmon branch, you may find the tutorial instructive and the corresponding feature branch useful. If you believe that the annotations themselves are incomplete/incorrect and that may be leading to some of this behavior, you might consider augmenting or updating those annotations. Finally, I'd be reticent to just go with FeatureCounts instead here. While the heuristics employed by the overlap and counting rules may accord with what you expect for this gene or some subset of similar genes, the counting based approach implements several heuristics that can be problematic in a number of other scenarios.

Hopefully this helps answer your question about this behavior. If you end up discussing this with the nf-core folks, I'd be happy to be involved in that discussion as well.

Best, Rob

Hi @FlorianRNA ! As stated in the usage docs for the nf-core/rnaseq pipeline:

"Since v3.0 of the pipeline, featureCounts is no longer used to perform gene/transcript quantification, however it is still used to generate QC metrics based on biotype information available within GFF/GTF genome annotation files. This decision was made primarily because of the limitations of featureCounts to appropriately quantify gene expression data. Please see Zhao et al., 2015 and Soneson et al., 2015."

This is a common cause of confusion and I have tried to be as explicit about this in the docs. featureCounts is used to quantify features in the annotation by gene_biotype and not the actual gene / transcript features themselves. This may explain why you are seeing these discrepancies. However, I am still a little puzzled how you are able to directly compare the counts generated by featureCounts and Salmon (in either mode) because the core features that are being quantified should be different. Where did you get the plant reference genome from? If it's not from Ensembl then it probably isn't worth running the biotype quantification with featureCounts anyway because the GTF annotation files may not contain that information. There are some docs for this here.

Hope that helps and if you think we can improve the pipeline in any way please feel free to create an issue on the nf-core/rnaseq repo.

Hello @rob-p,

thank you very much for your quick and detailed answer!

Why are the salmon counts for this gene much lower when using alignment mode and mapping mode under nf-core?

This theory makes perfect sense and would explain the low counts for the short genes perfectly. When I check the bam files, I see quite a few reads that are only partially in the region of the particular feature. I looked at the meta_info.json:

"num_bootstraps": 0,
"num_processed": 36672829,
"num_mapped": 27737862,
"num_decoy_fragments": 2690181,
"num_dovetail_fragments": 59605,
"num_fragments_filtered_vm": 2897142,
"num_alignments_below_threshold_for_mapped_fragments_vm": 4117684,
"percent_mapped": 75.63600288376989,

It looks like the decoy percentage is substantial. I'll run the pseudo-aligment in decoy mode and take a look at the unmapped reads to see if there are many that could be mapped to the chloroplast genome (I'll update this later).

Why am I seeing much higher values for this gene with FeatureCounts?

I have now run FeatureCounts several times with different overlaps (minOverlap =25, minOverlap =50, minOverlap =75min Overlap =100) and indeed the counts have decreased (again the psbI example: 8685 , 6011, 4237, 1805 accordingly). Again, this is a good argument for the hypothesis put forward.

Why does running Salmon outside nf-core lead to much higher values?

Hopefully, after I run Decoy mode, this problem is solved.

I also tried mapping mode with the --softclipOverhangs option. That increased the counts (psbI : 4696 counts); playing around with the --minScoreFraction flag in addition to the --softclipOverhangs flag also increased the numbers ( minScoreFraction= 0 ->psbI = 8496; minScoreFraction= 0.5 ->psbI = 5633; minScoreFraction= 0.7 ->psbI =3627 ).

So, in summary, your explanation seems to be completely correct. In the case that decoy mode resolves the difference between the pipeline and the run outside the pipeline, I would not give this to the nf-core people. But I will if there are still large discrepancies after the run.

I'm still not sure what the best parameters are for my analysis, but the --softclipOverhangs flag seems to be the best option for me now. So thanks again!

@drpatelh

Thank you very much for your quick reply as well. I was a bit inaccurate when I said I used the FeatureCounts from the pipeline. I actually wasn't able to use the resulting .txt files. Instead, I used the resulting bam file from the pipeline to perform a FeatureCounts analysis on R. I hope this information answers the question of how I can compare the two results? My genome and gtf file are from EnsemblPlants, so they should be fine. In the MultiQC file, the vast majority of reads align to protein coding regions according to FeatureCounts, so I hope my primary files are fine.

Thanks again for your help and time!

All the best Florian

Ah, my bad. I assumed you were running featureCounts via the pipeline. Thanks for clarifying.

Happy to incorporate changes into the pipeline in the future if they improve the default behaviour. For now, you can tweak the settings you provide the pipeline to incorporate the --softclipOverhangs parameter. You can put the snippet below in a file called custom.config and pass to the pipeline on the CLI with -c custom.config:

process {
    withName: '.*:QUANTIFY_SALMON:SALMON_QUANT' {
        ext.args = '--softclipOverhangs'
    }
}

Let me know if you have any problems with this.

COMBINE-lab / salmon

Quantification of RNA seq reads from the chloroplast #798