alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

estimated cell number very different from cellranger.v5 #1381

Open Anny225225 opened 3 years ago

Anny225225 commented 3 years ago

Hello Alex,

I have a 10x sample, the mapping/counting results from cellranger and starsolo are very different.

The estimated cell number of cellranger count with default parameters was 9400 while starsolo estimated 3600. I am confused by the big variation. When I check the results from cellranger, there are two big cell clusters with very high UMI count (>25k) while the rest cell clusters have UMI <5k. Starsolo results have a more evenly distribution UMI across clusters. However, the median UMI for cell in cellranger is 3577 and in starsolo is 16493 which is very high I think.

I wonder if starsolo uses a doublet filter for cell calling? what the reason for the difference?

Thank you!

Dedails:

starsolo:

STAR --genomeDir starsolo --soloType CB_UMI_Simple --soloCBwhitelist 10x_V3_whitelist.txt --soloUMIlen 12 --readFilesIn ${wd}/2270183_P7_2_S2_L001_R2_001.fastq.gz,${wd}/2270183_P7_2_S2_L002_R2_001.fastq.gz ${wd}/2270183_P7_2_S2_L001_R1_001.fastq.gz,${wd}/2270183_P7_2_S2_L002_R1_001.fastq.gz --runThreadN 20 --outFileNamePrefix s1 --outSAMtype BAM SortedByCoordinate --outReadsUnmapped elp1_s2_Unmapped --twopassMode Basic --chimSegmentMin 20 --readFilesCommand zcat --clipAdapterType CellRanger4 --outFilterScoreMin 30 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR

Barcodes.stats: nNoAdapter 0 nNoUMI 0 nNoCB 0 nNinCB 0 nNinUMI 6289 nUMIhomopolymer 286950 nTooMany 0 nNoMatch 21844303 nMismatchesInMultCB 0 nExactMatch 887950279 nMismatchOneWL 5268661 nMismatchToMultWL 19035523 Features.stats: nUnmapped 89644891 nNoFeature 373085155 nAmbigFeature 13249278 nAmbigFeatureMultimap 11984809 nTooMany 1363653 nNoExactMatch 0 nExactMatch 427453001 nMatch 434911486 nCellBarcodes 2149182 nUMIs 131380805

Summary.csv: Number of Reads,934392005 Reads With Valid Barcodes,0.974849 Sequencing Saturation,0.697914 Q30 Bases in CB+UMI,0.948347 Q30 Bases in RNA read,0.923572 Reads Mapped to Genome: Unique+Multiple,0.898433 Reads Mapped to Genome: Unique,0.756247 Reads Mapped to Transcriptome: Unique+Multipe Genes,0.479628 Reads Mapped to Transcriptome: Unique Genes,0.465449 Estimated Number of Cells,3618 Reads in Cells Mapped to Unique Genes,239834914 Fraction of Reads in Cells,0.551457 Mean Reads per Cell,66289 Median Reads per Cell,59441 UMIs in Cells,67707559 Mean UMI per Cell,18714 Median UMI per Cell,16493 Mean Genes per Cell,4247 Median Genes per Cell,4175 Total Genes Detected,23201

cellranger:

$cellranger count --id=s1 --transcriptome=refdata-gex-mm10-2020-A --fastqs=2-1649641 --localcores=20 --localmem=300

Estimated Number of Cells | 9449 Mean Reads per Cell | 98887 Median Genes per Cell | 1583934 Number of Reads | 392005 Valid Barcodes | 97.10% Sequencing Saturation | 69.90% Q30 Bases in Barcode | 94.90% Q30 Bases in RNA Read | 92.40% Q30 Bases in UMI | 94.70% Reads Mapped to Genome | 90.00% Reads Mapped Confidently to Genome | 85.30% Reads Mapped Confidently to Intergenic Regions | 7.00% Reads Mapped Confidently to Intronic Regions | 26.40% Reads Mapped Confidently to Exonic Regions | 51.90% Reads Mapped Confidently to Transcriptome | 48.20% Reads Mapped Antisense to Gene | 2.70% Fraction Reads in Cells | 64.80% Total Genes Detected | 23824 Median UMI Counts per Cell | 3577

alexdobin commented 3 years ago

Hi @Anny225225

I think the main difference is the cell filtering (calling). To match CellRanger filtered cells, you would need to use --soloCellFilter EmptyDrops_CR when running STARsolo. Or you can apply this filter to the raw matrices you already generated: STAR --runMode soloCellFiltering /path/to/count/dir/raw/ /path/to/output/prefix --soloCellFilter EmptyDrops_CR

Cheers Alex

JihedC commented 2 years ago

Hi Alex,

I have the same issue as Anny describes. I am analysing some 5' 10X snRNA-seq dataset. The number of cells estimated by Cellrangers is around 9000 cells for all samples but if I use STAR solo for the mapping, I only get ~5000 cells.

Here is the script I used for one of the sample:

STAR --genomeDir $INDEX \
    --readFilesIn \
    $READS/WT1_PC_S5_L001_R1_001.fastq.gz,$READS/WT1_PC_S5_L002_R1_001.fastq.gz,$READS/WT1_PC_S5_L003_R1_001.fastq.gz \
    $READS/WT1_PC_S5_L001_R2_001.fastq.gz,$READS/WT1_PC_S5_L002_R2_001.fastq.gz,$READS/WT1_PC_S5_L003_R2_001.fastq.gz \
    --soloBarcodeMate 1   --clip5pNbases 39 0\
    --soloType CB_UMI_Simple   --soloCBstart 1   --soloCBlen 16   --soloUMIstart 17   --soloUMIlen 10 \
    --soloCellFilter EmptyDrops_CR --soloCBwhitelist $WHITE/737K-august-2016.txt \
    --readFilesCommand zcat

I followed the instructions you gave in you documentation about the 5' 10X. Therefore I flipped the order of Read1 and Read2 in the script.

I have also added the --soloCellFilter EmptyDrops_CR.

Here is the summary I got:

Number of Reads,31653085
Reads With Valid Barcodes,0.848661
Sequencing Saturation,0.941041
Q30 Bases in CB+UMI,0.946439
Q30 Bases in RNA read,0.921667
Reads Mapped to Genome: Unique+Multiple,0.915819
Reads Mapped to Genome: Unique,0.896513
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.0462742
Reads Mapped to Transcriptome: Unique Genes,0.0460499
Estimated Number of Cells,4890
Reads in Cells Mapped to Unique Genes,766380
Fraction of Reads in Cells,0.525774
Mean Reads per Cell,156
Median Reads per Cell,103
UMIs in Cells,48197
Mean UMI per Cell,9
Median UMI per Cell,8
Mean Genes per Cell,8
Median Genes per Cell,7
Total Genes Detected,7382

I checked the bam files generated by Cellrangers and by Star solo, they are very similar. That seems to indicate that the difference in cell number is due to the cell filtering.

Can you let me know what you think? And if you have ideas to solve the issue?

Thanks in advance,

Jihed

alexdobin commented 2 years ago

Hi Jihed,

I think it's the strand problem, please try --soloStrand Reverse

Best, Alex

JihedC commented 2 years ago

Hi @alexdobin ,

Using --soloStrand Reverse improved the number of cells detected to 6363 but it still lower than Cell Ranger.

Best,

Jihed