alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

cellranger retrieves more counts than STARsolo #813

Closed tilofrei closed 4 years ago

tilofrei commented 4 years ago

Hi,

Comparing the count matrices of cellranger 3.1.0 and star 2.7.3a for 10x v3 single cell data I noticed that star retrieved 3.8% less counts than cellranger.

I restricted the count matrices to the genes and barcodes found in cellrangers "filtered" output and plotted the number of UMIs per barcode between the two tools: comparison In total counts cellranger yielded 53807309 and star yielded 51718452 counts (-3.8%).

I used the following commands:

  1. cellranger: cellranger count --id sample1 --sample=sample1 --fastqs=./sample1 --transcriptome=$CELLRANGER_REF300/GRCh38 --localcores=$SLURM_CPUS_PER_TASK --localmem=34

  2. STAR: STAR --quantMode GeneCounts --soloType CB_UMI_Simple --soloCBwhitelist ./3M-february-2018.txt --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12 --soloBarcodeReadLength 1 --soloFeatures Gene Velocyto --soloUMIfiltering MultiGeneUMI --soloCBmatchWLtype 1MM_multi_pseudocounts --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI CR UR CB UB GX GN --runThreadN 16 --genomeDir /fdb/STAR_indices/2.7.3a/GENCODE/Gencode_human/release_27/ref --sjdbGTFfile /fdb/STAR_indices/2.7.3a/GENCODE/Gencode_human/release_27/genes.gtf --readFilesCommand zcat --readFilesPrefix ./sample1/ --readFilesIn sample1_S4_L004_R2_001.fastq.gz,sample1_S4_L005_R2_001.fastq.gz sample1_S4_L004_R1_001.fastq.gz,sample1_S4_L005_R1_001.fastq.gz --outTmpDir=/lscratch/$SLURM_JOB_ID/STARtmp --outFileNamePrefix star_out/sample1/

The reference genome of cellranger is "pre-built human (hg19, GRCh38)" and of star is "GENCODE Human Release 27 (GRCh38.p10)".

Would you know if I made a mistake in my comparison or why STARsolo gets less counts than cellranger? I went through all github issues trying to find the parameters that make them match. If there is a good explanation that the skipped reads were "faulty" I wouldn't mind to get less actually.

Thanks for providing this great package!

Best Tilo

alexdobin commented 4 years ago

Hi Tilo,

I think the most likely explanation is in the difference of annotations (GTF). CellRanger uses an abridged version of Ensembl annotations, which removes a large number of genes (transcripts). In particular, most pseudogenes are removed, which may increase the counts towards the normal genes.

You could use the CellRanger GTF with STARsolo, which should yield a very good agreement. And for an almost perfect match you would need to specify --genomeSAsparseD 3 at the genome generation step.

Cheers Alex

tilofrei commented 4 years ago

Thanks for the insight - very helpful! I think that makes sense and I'll keep the gencode ref's for now. (Just feeding the cellranger GTF to STAR lead to abnormally low counts and switching ref+GTF both, generated an error from STAR that the annotation is outdated.) Best Tilo

tilofrei commented 4 years ago

For reference this is the script I used to compare the counts before:

library("Seurat")
library("ggplot2")

# load data
#1
data_dir1 = './filtered_feature_bc_matrix/'
expression_matrix1 = Seurat::Read10X(data.dir = data_dir1)
bc1=colnames(expression_matrix1)
ft1=rownames(expression_matrix1)
#2
data_dir2 = './Solo.out/Gene/raw/'
expression_matrix2 = Seurat::Read10X(data.dir = data_dir2)
bc2=colnames(expression_matrix2)
ft2=rownames(expression_matrix2)

# find common genes/barcodes
bc = intersect(bc1,bc2)
ft = intersect(ft1,ft2)

# plot barcodes and their nUMIs
genes1 = rowSums(expression_matrix1[ft,bc])
genes2 = rowSums(expression_matrix2[ft,bc])
rows_data = data.frame(cbind(genes1, genes2))

ggplot(rows_data, aes(x=genes1, y=genes2)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1)

# get total counts
sum1 = sum(expression_matrix1[ft,bc])
sum2 = sum(expression_matrix2[ft,bc])
print(sum1 - sum2) # difference