alexdobin / STAR

RNA-seq aligner
MIT License
1.78k stars 497 forks source link

best practice for alignment to available, though distant, reference genome(s) #850

Open cb4github opened 4 years ago

cb4github commented 4 years ago

Dear Alex,

Briefly (see X.Log.final.out for 2 reference genomes X below), I've gotten unmapped percentages of 99.83% and 99.27% in mapping ~200M reads (for 8 samples of 2 species) of shark to arguably distant reference genomes, coelacanth and elephant shark, respectively.

Can you suggest best practice for choosing mapping parameters to increase the mapped reads count as such while somehow controlling overmatching?

Please let me know if you need more info, thanks.

Best, CB

find .. -name "*final.out" -print -exec cat {} \;
../coelacanth/Reads_to_coelacanth.Log.final.out
                                 Started job on |       Feb 29 22:16:13
                             Started mapping on |       Feb 29 22:19:36
                                    Finished on |       Mar 01 00:06:08
       Mapping speed, Million of reads per hour |       113.20

                          Number of input reads |       200994741
                      Average input read length |       300
                                    UNIQUE READS:
                   Uniquely mapped reads number |       305298
                        Uniquely mapped reads % |       0.15%
                          Average mapped length |       250.86
                       Number of splices: Total |       42220
            Number of splices: Annotated (sjdb) |       30772
                       Number of splices: GT/AG |       41003
                       Number of splices: GC/AG |       3
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       1214
                      Mismatch rate per base, % |       3.62%
                         Deletion rate per base |       0.06%
                        Deletion average length |       1.19
                        Insertion rate per base |       0.16%
                       Insertion average length |       2.49
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       29177
             % of reads mapped to multiple loci |       0.01%
        Number of reads mapped to too many loci |       394
             % of reads mapped to too many loci |       0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       99.83%
                     % of reads unmapped: other |       0.00%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%
../elephant_shark/Reads_to_elephant_shark.Log.final.out
                                 Started job on |       Feb 12 16:59:18
                             Started mapping on |       Feb 12 17:02:08
                                    Finished on |       Feb 12 20:10:35
       Mapping speed, Million of reads per hour |       63.99

                          Number of input reads |       200994741
                      Average input read length |       300
                                    UNIQUE READS:
                   Uniquely mapped reads number |       973915
                        Uniquely mapped reads % |       0.48%
                          Average mapped length |       258.59
                       Number of splices: Total |       389197
            Number of splices: Annotated (sjdb) |       342236
                       Number of splices: GT/AG |       379757
                       Number of splices: GC/AG |       4635
                       Number of splices: AT/AC |       127
               Number of splices: Non-canonical |       4678
                      Mismatch rate per base, % |       3.36%
                         Deletion rate per base |       0.40%
                        Deletion average length |       1.96
                        Insertion rate per base |       0.30%
                       Insertion average length |       2.06
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       465809
             % of reads mapped to multiple loci |       0.23%
        Number of reads mapped to too many loci |       843
             % of reads mapped to too many loci |       0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       99.27%
                     % of reads unmapped: other |       0.02%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%
alexdobin commented 4 years ago

Hi @cb4github

STAR will not work well for mapping reads to distant species genomes... The maximum divergence it can deal with is ~3-5%. You can try to reduce --seedSearchStartLmax to 10, and increase --winAnchorMultimapNmax to 200 (or 500), but I doubt it will improve the results sufficiently.

Cheers Alex