kobeho24 commented 7 years ago

Hi everyone, Lately I was trying to use Salmon (v0.8.0) along with tximport to study a downloaded single-cell data on gene-level. And I came across something werid that I found almost 20k genes on average per cell, which is way higher than expected. Realistically, the scRNA-seq protocol I followed will only have a gene number detection of roughly 10k. I re-do the analysis with STAR + featurecounts and I observed a per-cell gene number of 6k on average. Just wonder which part of my code goes wrong. Attached please find my code for Salmon and subsequent R script for tximport. Any advice or suggestion will be much appreciated! I do love Salmon for its speed and convenience.

Salmon index

!/bin/bash

salmon index -t /home1/garyhe/workingdir/ref/gencode/gencode.v25.transcripts.ercc.fa -i /home1/garyhe/workingdir/ref/index/salmonindex/v25/19mer --gencode --type quasi -k 19

Salmon quant

!/bin/bash

cd /home1/garyhe/workingdir/data/bjorklund2016ni/00_raw

for i in $(ls *.gz | cut -c 1-10 | uniq)

do

salmon quant -i /home1/garyhe/workingdir/ref/index/salmonindex/v25/19mer \ -l U \ -r ${i}_1.fastq.gz \ --writeMappings=/home1/garyhe/workingdir/data/bjorklund2016ni/01_aligned/${i}.sam \ -o /home1/garyhe/workingdir/data/bjorklund2016ni/02_quant/${i}

Done

R script for tximport

condense the ensemble transcript ID counts to gene ID counts

library(GenomicFeatures) txdb <- makeTxDbFromGFF("./Sequencing_run/gencode.vM12.primary_assembly.annotation.ercc.gtf") k <- keys(txdb, keytype = "GENEID") df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME") tx2gene <- df[, 2:1] head(tx2gene)

library(tximport) library(readr) dir <- "./Sequencing_run/salmon_output/sc/" list.files(dir) samples <- read.table("./Sequencing_run/salmon_output/scsampleinfo.txt", header=TRUE) samples files <- file.path(dir, samples$Sample_ID, "quant.sf") names(files) <- paste0(samples$SampleID) names(files) <- gsub("[::].*$", "", names(files))

gene-level

txi.salmon.g <- tximport(files, type = "salmon", tx2gene = tx2gene, reader = read_tsv) names(txi.salmon.g) head(txi.salmon.g$counts) gcounts.txt <- txi.salmon.g$counts write.csv(gcounts.txt, "./Sequencing_run/salmon_output/sc/gcounts.csv")

Best!

vals commented 7 years ago

Hey,

What is your cutoff for considering a gene expressed? Salmon puts a small > 0 expression on most genes. Usually we count genes over e.g. 5 reads, or e.g. > 1 TPM, depending analysis.

kobeho24 commented 7 years ago

Hi vals, Cutoffs for Salmon as well as STAR+featurecounts/RSEM are all >0, no matter it is normalized value (RPKM, TPM) or rawcount. To my knowledge, there shouldn't be a hugh difference between different pipeline in terms of number of detected genes. Somehow, I think Salmon is over-sensitive to some extent. It's good to know that there will be small >0 expression on most genes. That makes the thing clear~

Best! Gary

vals commented 7 years ago

Hey Gary,

It's not really overly sensitive, it's more that changes in the level of 0.01 read or 0.01 TPM is within the range of quantification uncertainty. A gene being e.g. 0.1 TPM different between samples isn't meaningful. And all genes get some small amount of expression.

Do you get reasonable gene counts if you count genes with > 1 counts in stead of > 0 counts?

kobeho24 commented 7 years ago

Hi vals, Interestingly, I just did the gene number summary for both Salmon and featureCounts. Cutoffs are counts >=5. Still, Salmon gives a roughly 2-fold increment on detected gene number. See from links as follows, Salmon: https://flic.kr/p/QRZhVq featureCounts: https://flic.kr/p/RXw7FK I was wondering if my Salmon code attached above was wrong.

Best! Gary

vals commented 7 years ago

Hi Gary,

Oh that's quite weird!

How do you deal with multimapping reads in the featureCounts pipeline? I haven't looked at this particular data set, but the Sandberg lab usually have very short (single end) reads, which I suspect will multimap a lot?

If you are removing multimapping reads in the featureCounts pipeline before counting, this might explain things, since any genes with ambiguously mappable sequences won't be counted. (In particular, the Sandberg lab tend to use genome references with ambiguous regions removed.)

kobeho24 commented 7 years ago

Hi vals, Yes, the data is single end 43bp. Which is very short! So I set the k=19 when doing Salmon indexing for quasi-mapping.

My featureCounts script is as follow.

!/bin/bash

GTF=/home1/garyhe/workingdir/ref/gencode/gencode.v25.primary_assembly.annotation.ercc.gtf

cd /home1/garyhe/workingdir/data/bjorklund2016ni/01_aligned

featureCounts *Aligned.sortedByCoord.out.bam -M -T 24 -a $GTF -o /home1/garyhe/workingdir/data/bjorklund2016ni/02_quant/ilc.unprocessed.matrix.txt

cd /home1/garyhe/workingdir/data/bjorklund2016ni/02_quant/

cat ilc.unprocessed.matrix.txt | cut -f1,7- | sed 1d > ilc.matrix.txt

Actually, multimapped reads were all taken into account with the -M option.

And as a matter of fact, a gene counting without multimapped reads was also done. The number of detected gene (cutoff=5) was even lower :( https://flic.kr/p/S9sPp4

vals commented 7 years ago

Hi Gary,

Ok, I'm out of ideas...

Do the number of genes per cell at least correlate between the pipelines?

kobeho24 commented 7 years ago

Hi vals, I don't know the correlation between Salmon and featureCounts pipelines. However, I've just done similar counting with htseq-count (-m union) . The result of gene number is similar (just slightly higher) to featureCounts (w/o multimapped reads). https://flic.kr/p/RXy39z As far as I can tell, count-based methods performs similarly. The number of detected gene is within expectation.

rob-p commented 7 years ago

Hi Gary,

Two ideas to try. What do the salmon numbers look like if you use a larger k (instead of 19)? Also, what does it look like for one of these samples if you run salmon in alignment mode?

Edit: It also makes sense to look at the number of reads Salmon maps (which is in aux_info/meta_info.json compared to what e.g. STAR maps.

rob-p commented 7 years ago

@kobeho24, Any update on this? Have you had a chance to try any of the suggestions?

kobeho24 commented 7 years ago

Hi Rob, I haven't got a chance to test whether the suggestions help or not. But I guess it's due to the fact that the read is super short, single-end only as well, thus introduce a huge ambiguity in the quasi-mapping step. Anyway, I will try to test it later.

kobeho24 commented 7 years ago

Hi Rob and val, I re-compare the number of detected gene of a new data set (PE100, 96 cells, N.Wilson 2015 Cell Stem Cell) with featureCounts, Salmon and Kallisto. All the three has similar gene detected number in which k-mer based methods produce a slight higher gene number. I guess it's mainly due to the the fact that whether multimapping reads are taken into account. Anyway, it confirms my guess about the the issue described above may just because quasi-mapping is not so friendly with short single end reads, which may introduce a huge ambiguity in the quantitation process. Since my own data set won't be short and single-end, I suppose I will keep using salmon to analyze my data in the future. Thanks for your contribution and help to this issue.

vals commented 7 years ago

Hey Gary, that's good to know :)

Your work comparing these things is really useful, have you considered putting it up somewhere public? I get a lot of queries from people on whether read length / paired reads matter, having plots to point at to people would be great.

COMBINE-lab / salmon

ISSUE: Number of genes detected in a single cell #120

!/bin/bash

!/bin/bash

condense the ensemble transcript ID counts to gene ID counts

gene-level

!/bin/bash