alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 495 forks source link

five prime paired end star solo settings #1366

Open cartographerJ opened 2 years ago

cartographerJ commented 2 years ago

Hi, I am trying figure out what settings to use in order for star solo to most closely match the outputs for cell ranger 5xx for 10x 5' paired end 150bp reads. I am currently trying the following, but the results are totally off in the gene/filtered/.. outputs.

STAR \
        --genomeDir ./refdata-gex-GRCh38-STAR/ \
        --readFilesIn $(echo my_files/*R2*fastq.gz | sed 's/ /,/g') $(echo my_files/*R1*fastq.gz | sed 's/ /,/g') \
        --readFilesCommand zcat \
        --soloFeatures Gene SJ GeneFull \
        --runThreadN 16 \
        --soloType CB_UMI_Simple \
        --soloCBwhitelist ./whitelists/v2_737K-august-2016.txt \
        --outFileNamePrefix  "test." \
        --outFilterScoreMin 30 \
        --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
        --soloUMIfiltering MultiGeneUMI_CR \
        --soloUMIdedup 1MM_CR \
        --soloCellFilter  EmptyDrops_CR \
        --soloBarcodeMate 1 \
        --clip5pNbases 39 0 \
        --soloCBstart 1 \
        --soloCBlen 16 \
        --soloUMIstart 17 \
        --soloUMIlen 10 \
        --outSAMtype BAM Unsorted

Also , to note: if I include Velocyto in the --soloFeatures I get a seg fault.

Thanks! Jeffrey

cartographerJ commented 2 years ago

To follow up on this the Gene counts make no sense (not an issue of trying to ~exact match cellranger). I'm including the Gene summary below.

Number of Reads,268206181 Reads With Valid Barcodes,0.00573598 Sequencing Saturation,0.683844 Q30 Bases in CB+UMI,0.911059 Q30 Bases in RNA read,0.873691 Reads Mapped to Genome: Unique+Multiple,0.857543 Reads Mapped to Genome: Unique,0.743028 Reads Mapped to Gene: Unique+Multipe Gene,0.000280952 Reads Mapped to Gene: Unique Gene,0.000241299 Estimated Number of Cells,19409 Unique Reads in Cells Mapped to Gene,64689 Fraction of Unique Reads in Cells,0.999552 Mean Reads per Cell,3 Median Reads per Cell,2 UMIs in Cells,20461 Mean UMI per Cell,1 Median UMI per Cell,1 Mean Gene per Cell,1 Median Gene per Cell,1 Total Gene Detected,7104

as compared to cellranger output: 5'pe

I have also tested both v2 and v3 3' and get reasonable results using the following some variation of the following depending on the chemistry (suggesting the reference is fine).

STAR \
        --genomeDir ./refdata-gex-GRCh38-STAR/ \
        --readFilesIn $(echo ./5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/*R2*fastq.gz | sed 's/ /,/g') $(echo ./5k_pbmc_protein_v3_fastqs/5k_pbmc_protein_v3_gex_fastqs/*R1*fastq.gz | sed 's/ /,/g') \
        --readFilesCommand zcat \
        --soloFeatures Gene SJ GeneFull \
        --runThreadN 16 \
        --soloType CB_UMI_Simple \
        --soloCBwhitelist ./whitelists/v3_3M-february-2018.txt \
        --outFileNamePrefix "test/" \
        --outFilterScoreMin 30 \
        --clipAdapterType CellRanger4 \
        --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
        --soloUMIfiltering MultiGeneUMI_CR \
        --soloUMIdedup 1MM_CR \
        --soloCellFilter EmptyDrops_CR \
        --soloCBstart 1 \
        --soloCBlen 16 \
        --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM \
        --soloUMIstart 17 \
        --soloUMIlen 12 \
        --outSAMtype BAM Unsorted

with Gene summarys that makes sense for v3 and v2 repsectively:

Number of Reads,151731342 Reads With Valid Barcodes,0.976629 Sequencing Saturation,0.522822 Q30 Bases in CB+UMI,0.956087 Q30 Bases in RNA read,0.919195 Reads Mapped to Genome: Unique+Multiple,0.956088 Reads Mapped to Genome: Unique,0.851986 Reads Mapped to Gene: Unique+Multipe Gene,0.555485 Reads Mapped to Gene: Unique Gene,0.548225 Estimated Number of Cells,5246 Unique Reads in Cells Mapped to Gene,72738816 Fraction of Unique Reads in Cells,0.874444 Mean Reads per Cell,13865 Median Reads per Cell,12534 UMIs in Cells,33685689 Mean UMI per Cell,6421 Median UMI per Cell,5742 Mean Gene per Cell,1849 Median Gene per Cell,1724 Total Gene Detected,22867

Number of Reads,379462522 Reads With Valid Barcodes,0.978749 Sequencing Saturation,0.910022 Q30 Bases in CB+UMI,0.975574 Q30 Bases in RNA read,0.887535 Reads Mapped to Genome: Unique+Multiple,0.96368 Reads Mapped to Genome: Unique,0.88739 Reads Mapped to Gene: Unique+Multipe Gene,0.645356 Reads Mapped to Gene: Unique Gene,0.639177 Estimated Number of Cells,4589 Unique Reads in Cells Mapped to Gene,226875830 Fraction of Unique Reads in Cells,0.935402 Mean Reads per Cell,49439 Median Reads per Cell,44894 UMIs in Cells,18762297 Mean UMI per Cell,4088 Median UMI per Cell,3711 Mean Gene per Cell,1261 Median Gene per Cell,1207 Total Gene Detected,21161

Happy to provide more info as needed, thanks!

cartographerJ commented 2 years ago

I also tried with the parameter --soloStrand Reverse given a prior posting with similar results where it just seems to not be counting things correctly.

Number of Reads,268206181 Reads With Valid Barcodes,0.0049698 Sequencing Saturation,0.89122 Q30 Bases in CB+UMI,0.911059 Q30 Bases in RNA read,0.873691 Reads Mapped to Genome: Unique+Multiple,0.857543 Reads Mapped to Genome: Unique,0.743028 Reads Mapped to Gene: Unique+Multipe Gene,0.00239027 Reads Mapped to Gene: Unique Gene,0.00234211 Estimated Number of Cells,59324 Unique Reads in Cells Mapped to Gene,628019 Fraction of Unique Reads in Cells,0.999761 Mean Reads per Cell,10 Median Reads per Cell,2 UMIs in Cells,68332 Mean UMI per Cell,1 Median UMI per Cell,1 Mean Gene per Cell,1 Median Gene per Cell,1 Total Gene Detected,12246

alexdobin commented 2 years ago

Hi Jeffrey,

I think the 10X-5' protocol has barcode on read1. So if you use --soloBarcodeMate 1 you need to supply read1 and read2 files in the normal order: --readFilesIn read1.fq read2.fq.

Cheers Alex

cartographerJ commented 2 years ago

Hi Alex, thanks for the response! Will test and get back here to close this.

JihedC commented 2 years ago

Hi @cartographerJ ,

Did the suggestion help to fix the problem?

cartographerJ commented 2 years ago

Things definitely ran, but the cell numbers/counts were pretty different from cellranger 5.0.0, unlike with the 3’ kits where you get almost perfect concordance between cellranger and star-solo

JihedC commented 2 years ago

Hi @cartographerJ ,

That's exactly the same for me. We used CellRanger 6.0.0 and measured ~9000 cells per sample while with STAR solo, I get only 4000 cells per sample. I have tried to add different arguments following the issues I could find on github, but that did not improve the output. If you ever find a fix, please let me know :)

camelest commented 2 years ago

Hi, @cartographerJ and @JihedC, I'm just wondering whether you found any solution. I also have only half #cells when processed by STARsolo. Also the #reads per cell were significantly low.

cartographerJ commented 2 years ago

I have not figured this out, but would love for it to be solved @alexdobin so that I can use Starsolo instead of Cellranger

camelest commented 2 years ago

@cartographerJ @alexdobin I see, thank you so much for your response. I'm pretty sure that the bam file itself looks similar. When I deduplicate the bam files based on CR and UR, it also looks fine. I guess it is just some cell barcode filtering problem. I would also love to use STARsolo over Cellranger in 5' data as well.

JihedC commented 2 years ago

Same here, I didn't get it to work. Following the advices of Alex I got more cells but was still missing 30% of the total cells.