alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

STAR parameters for Drop-seq data, and application of STARsolo #1018

Open Kaj-cyber opened 4 years ago

Kaj-cyber commented 4 years ago

Dear Alex,

I would like to ask for your kind recommendation. I'm working with canine Drop-seq data by which each transcript read has 55 bp. The length of barcode+UMI is 20 bp. The example data was included as follows (https://sra-download.ncbi.nlm.nih.gov/traces/sra51/SRR/010756/SRR11014432) According to Drop-seq cookbook which recommened default setting of STAR during alignment, it renders alignment percentage around 50%

Would you please suggest me how to adjust parameters to obtain the better result? (I've never worked with aligment with barcode before and totally have no idea where should I start)

With all due respects, I'm also wondering if it is possible to apply STARsolo with the Drop-seq data? Due to my lack of knowledge and limited experience, I try so hard to find an example for such application without success.

Your kind suggestion would be sincerely appreciated.

Best regards, Kaj

alexdobin commented 4 years ago

Hi Kaj,

please post the entire Log.final.out file. The mappability of 50% is not great - it depends on sequencing quality as well as genome assembly. It's a good idea to assess the "baseline" mappability using a bulk RNA-seq dataset from a similar tissue, and trimming its read length to 55b.

For Drop-seq, you need --soloType Droplet with proper CB and UMI starts/lengths with: --soloCBstart --soloCBlen --soloUMIstart --soloUMIlen and no-whitelist operation with --soloCBwhitelist None

Cheers Alex

Kaj-cyber commented 4 years ago

@alexdobin Dear Alex, I used to perform alignment with dog genome (transcript length=80-90 bp) and acquired 70-80%. After carefully reading your answers in several issues, I managed to acquired >70% alignment by modifying 3 parameters as follows: --outFilterScoreMinOverLread 0.30 \ --outFilterMismatchNoverLmax 0.1 \ --outFilterMultimapNmax 25 \

And the log.out result was as follows: ` Started job on | Sep 03 09:23:53 Started mapping on | Sep 03 09:24:06 Finished on | Sep 03 09:30:49 Mapping speed, Million of reads per hour | 1538.19

                      Number of input reads |   172191416
                  Average input read length |   50
                                UNIQUE READS:
               Uniquely mapped reads number |   127866532
                    Uniquely mapped reads % |   74.26%
                      Average mapped length |   51.94
                   Number of splices: Total |   5033819
        Number of splices: Annotated (sjdb) |   4193485
                   Number of splices: GT/AG |   4523960
                   Number of splices: GC/AG |   53191
                   Number of splices: AT/AC |   7629
           Number of splices: Non-canonical |   449039
                  Mismatch rate per base, % |   0.76%
                     Deletion rate per base |   0.03%
                    Deletion average length |   1.34
                    Insertion rate per base |   0.02%
                   Insertion average length |   1.28
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   18104863
         % of reads mapped to multiple loci |   10.51%
    Number of reads mapped to too many loci |   839570
         % of reads mapped to too many loci |   0.49%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 119710 % of reads unmapped: too many mismatches | 0.07% Number of reads unmapped: too short | 17741427 % of reads unmapped: too short | 10.30% Number of reads unmapped: other | 7519314 % of reads unmapped: other | 4.37% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%`

From FastQC result of the transcript read, I noticed poor Per base sequences of the first 5 bp of the transcript. I don't know exactly how to remove them using Drop-seq tool. I somehow highly suspected them as the source of poor alignment using default STAR parameters.

However, STARsolo rendered me a better result without any parameter modification!! It also allowed me to cut/trim the transcript reads with my normally used tools (flexbar, cutadapt). From your recommended setting, I got the alignment result as follows: ` Started job on | Sep 03 17:45:57 Started mapping on | Sep 03 17:46:12 Finished on | Sep 03 17:53:32 Mapping speed, Million of reads per hour | 1266.81

                      Number of input reads |   154831743
                  Average input read length |   47
                                UNIQUE READS:
               Uniquely mapped reads number |   119135101
                    Uniquely mapped reads % |   76.94%
                      Average mapped length |   47.39
                   Number of splices: Total |   4187119
        Number of splices: Annotated (sjdb) |   3550144
                   Number of splices: GT/AG |   3801461
                   Number of splices: GC/AG |   41066
                   Number of splices: AT/AC |   6309
           Number of splices: Non-canonical |   338283
                  Mismatch rate per base, % |   0.80%
                     Deletion rate per base |   0.03%
                    Deletion average length |   1.30
                    Insertion rate per base |   0.02%
                   Insertion average length |   1.21
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   13894435
         % of reads mapped to multiple loci |   8.97%
    Number of reads mapped to too many loci |   1576860
         % of reads mapped to too many loci |   1.02%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 17843723 % of reads unmapped: too short | 11.52% Number of reads unmapped: other | 2381624 % of reads unmapped: other | 1.54% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%`

I do prefer STARsolo as the aligner! After the test running, I tried to extract the expression matrix of all cells in a file using --quantMode GeneCounts option. However, it gave me like sum count overall values of the sam file.

#### Would you please guide me a bit further how to get such expression matrix (it's my last request for this issue, please feel free to close it after that)?

Best regards, Kaj

alexdobin commented 4 years ago

Hi Kaj,

--outFilterScoreMinOverLread 0.30 is somewhat dangerous, as you will be letting in some poor-quality (short) alignments. I would suggest trimming the 5 poor-quality bases with --clip5pNbases 5, while keeping --outFilterScoreMinOverLread as large as possible, ideally at its default 0.66

To get the expression matrix, you need to use --soloType Droplet --soloCBstart <> --soloCBlen <> --soloUMIstart <> --soloUMIlen <> --soloCBwhitelist None with with proper CB and UMI starts/lengths.

Cheers Alex