alexdobin / STAR

RNA-seq aligner
MIT License
1.83k stars 503 forks source link

std::bad_alloc #274

Closed tommycarstensen closed 6 years ago

tommycarstensen commented 7 years ago

Some of my STAR jobs fail with a std::bad_alloc. Has anyone experienced this before? How did they solve it? Here is the error message I get:

terminate called after throwing an instance of 'std::bad_alloc'   what():  std::bad_alloc mapping.sh: line 53:  9618 Aborted  

This is the tail of Log.out:

Loaded database junctions from file: out_STAR/38.83/pass1/NA21826/SJ.out.tab, total number of junctions: 188396707 junctions

Jun 05 15:23:26   Loaded database junctions from the sjdbFileChrStartEnd file(s), 188396707 total junctions

WARNING: long repeat for junction # 115383 : 15 88855522 88855578; left shift = 14; right shift = 255
WARNING: long repeat for junction # 157707 : 18 46969610 47029116; left shift = 183; right shift = 255
WARNING: long repeat for junction # 42599589 : 15 88855522 88855578; left shift = 14; right shift = 255
Jun 05 15:26:21   Finished preparing junctions
Jun 05 15:26:21 ..... inserting junctions into the genome indices

Here is my command line:

$star \
 --runMode alignReads \
 --runThreadN $runThreadN \
 --genomeDir genome_index_$affix \
 --readFilesIn $files1 $file2 \
 --readFilesCommand zcat \
 --outFileNamePrefix $outFileNamePrefix \
 --outSAMtype BAM SortedByCoordinate \
 --quantMode GeneCounts \
 --limitSjdbInsertNsj 4000000 \
 --sjdbFileChrStartEnd out_STAR/$affix/pass1/*/SJ.out.tab \

Here are further details from another log file:

STAR version=STAR_2.5.2b
STAR compilation time,server,dir=Fri Jan 20 16:41:38 GMT 2017 :/nfs/gapi/data/src/c/star_2.5.2b/STAR-2.5.2b/source
##### DEFAULT parameters:
versionSTAR                       20201
versionGenome                     20101   20200   
parametersFiles                   -   
sysShell                          -
runMode                           alignReads
runThreadN                        1
runDirPerm                        User_RWX
runRNGseed                        777
genomeDir                         ./GenomeDir/
genomeLoad                        NoSharedMemory
genomeFastaFiles                  -   
genomeSAindexNbases               14
genomeChrBinNbits                 18
genomeSAsparseD                   1
genomeSuffixLengthMax             18446744073709551615
readFilesIn                       Read1   Read2   
readFilesCommand                  -   
readMatesLengthsIn                NotEqual
readMapNumber                     18446744073709551615
readNameSeparator                 /   
inputBAMfile                      -
bamRemoveDuplicatesType           -
bamRemoveDuplicatesMate2basesN    0
limitGenomeGenerateRAM            31000000000
limitIObufferSize                 150000000
limitOutSAMoneReadBytes           100000
limitOutSJcollapsed               1000000
limitOutSJoneRead                 1000
limitBAMsortRAM                   0
limitSjdbInsertNsj                1000000
outTmpDir                         -
outTmpKeep                        None
outStd                            Log
outReadsUnmapped                  None
outQSconversionAdd                0
outMultimapperOrder               Old_2.4
outSAMtype                        SAM   
outSAMmode                        Full
outSAMstrandField                 None
outSAMattributes                  Standard   
outSAMunmapped                    None   
outSAMorder                       Paired
outSAMprimaryFlag                 OneBestScore
outSAMreadID                      Standard
outSAMmapqUnique                  255
outSAMflagOR                      0
outSAMflagAND                     65535
outSAMattrRGline                  -   
outSAMheaderHD                    -   
outSAMheaderPG                    -   
outSAMheaderCommentFile           -
outBAMcompression                 1
outBAMsortingThreadN              0
outSAMfilter                      None   
outSAMmultNmax                    18446744073709551615
outSAMattrIHstart                 1
outSJfilterReads                  All
outSJfilterCountUniqueMin         3   1   1   1   
outSJfilterCountTotalMin          3   1   1   1   
outSJfilterOverhangMin            30   12   12   12   
outSJfilterDistToOtherSJmin       10   0   5   10   
outSJfilterIntronMaxVsReadN       50000   100000   200000   
outWigType                        None   
outWigStrand                      Stranded   
outWigReferencesPrefix            -
outWigNorm                        RPM   
outFilterType                     Normal
outFilterMultimapNmax             10
outFilterMultimapScoreRange       1
outFilterScoreMin                 0
outFilterScoreMinOverLread        0.66
outFilterMatchNmin                0
outFilterMatchNminOverLread       0.66
outFilterMismatchNmax             10
outFilterMismatchNoverLmax        0.3
outFilterMismatchNoverReadLmax    1
outFilterIntronMotifs             None
clip5pNbases                      0   
clip3pNbases                      0   
clip3pAfterAdapterNbases          0   
clip3pAdapterSeq                  -   
clip3pAdapterMMp                  0.1   
winBinNbits                       16
winAnchorDistNbins                9
winFlankNbins                     4
winAnchorMultimapNmax             50
winReadCoverageRelativeMin        0.5
winReadCoverageBasesMin           0
scoreGap                          0
scoreGapNoncan                    -8
scoreGapGCAG                      -4
scoreGapATAC                      -8
scoreStitchSJshift                1
scoreGenomicLengthLog2scale       -0.25
scoreDelBase                      -2
scoreDelOpen                      -2
scoreInsOpen                      -2
scoreInsBase                      -2
seedSearchLmax                    0
seedSearchStartLmax               50
seedSearchStartLmaxOverLread      1
seedPerReadNmax                   1000
seedPerWindowNmax                 50
seedNoneLociPerWindow             10
seedMultimapNmax                  10000
alignIntronMin                    21
alignIntronMax                    0
alignMatesGapMax                  0
alignTranscriptsPerReadNmax       10000
alignSJoverhangMin                5
alignSJDBoverhangMin              3
alignSJstitchMismatchNmax         0   -1   0   0   
alignSplicedMateMapLmin           0
alignSplicedMateMapLminOverLmate    0.66
alignWindowsPerReadNmax           10000
alignTranscriptsPerWindowNmax     100
alignEndsType                     Local
alignSoftClipAtReferenceEnds      Yes
alignEndsProtrude                 0   ConcordantPair   
chimSegmentMin                    0
chimScoreMin                      0
chimScoreDropMax                  20
chimScoreSeparation               10
chimScoreJunctionNonGTAG          -1
chimJunctionOverhangMin           20
chimOutType                       SeparateSAMold
chimFilter                        banGenomicN   
chimSegmentReadGapMax             0
sjdbFileChrStartEnd               -   
sjdbGTFfile                       -
sjdbGTFchrPrefix                  -
sjdbGTFfeatureExon                exon
sjdbGTFtagExonParentTranscript    transcript_id
sjdbGTFtagExonParentGene          gene_id
sjdbOverhang                      100
sjdbScore                         2
sjdbInsertSave                    Basic
quantMode                         -   
quantTranscriptomeBAMcompression    1
quantTranscriptomeBan             IndelSoftclipSingleend
twopass1readsN                    18446744073709551615
twopassMode                       None
##### Command Line:
##### Initial USER parameters from Command Line:
###### All USER parameters from Command Line:
runMode                       alignReads     ~RE-DEFINED
runThreadN                    16     ~RE-DEFINED
genomeDir                     genome_index_38.83     ~RE-DEFINED
readFilesCommand              zcat        ~RE-DEFINED
outSAMtype                    BAM   SortedByCoordinate        ~RE-DEFINED
quantMode                     GeneCounts        ~RE-DEFINED
limitSjdbInsertNsj            4000000     ~RE-DEFINED
##### Finished reading parameters from all sources

##### Final user re-defined parameters-----------------:
runMode                           alignReads
runThreadN                        16
genomeDir                         genome_index_38.83
readFilesCommand                  zcat   
limitSjdbInsertNsj                4000000
outSAMtype                        BAM   SortedByCoordinate   
quantMode                         GeneCounts   

-------------------------------
##### Final effective command line:

##### Final parameters after user input--------------------------------:
versionSTAR                       20201
versionGenome                     20101   20200   
parametersFiles                   -   
sysShell                          -
runMode                           alignReads
runThreadN                        16
runDirPerm                        User_RWX
runRNGseed                        777
genomeDir                         genome_index_38.83
genomeLoad                        NoSharedMemory
genomeFastaFiles                  -   
genomeSAindexNbases               14
genomeChrBinNbits                 18
genomeSAsparseD                   1
genomeSuffixLengthMax             18446744073709551615
readFilesCommand                  zcat   
readMatesLengthsIn                NotEqual
readMapNumber                     18446744073709551615
readNameSeparator                 /   
inputBAMfile                      -
bamRemoveDuplicatesType           -
bamRemoveDuplicatesMate2basesN    0
limitGenomeGenerateRAM            31000000000
limitIObufferSize                 150000000
limitOutSAMoneReadBytes           100000
limitOutSJcollapsed               1000000
limitOutSJoneRead                 1000
limitBAMsortRAM                   0
limitSjdbInsertNsj                4000000
outTmpDir                         -
outTmpKeep                        None
outStd                            Log
outReadsUnmapped                  None
outQSconversionAdd                0
outMultimapperOrder               Old_2.4
outSAMtype                        BAM   SortedByCoordinate   
outSAMmode                        Full
outSAMstrandField                 None
outSAMattributes                  Standard   
outSAMunmapped                    None   
outSAMorder                       Paired
outSAMprimaryFlag                 OneBestScore
outSAMreadID                      Standard
outSAMmapqUnique                  255
outSAMflagOR                      0
outSAMflagAND                     65535
outSAMattrRGline                  -   
outSAMheaderHD                    -   
outSAMheaderPG                    -   
outSAMheaderCommentFile           -
outBAMcompression                 1
outBAMsortingThreadN              0
outSAMfilter                      None   
outSAMmultNmax                    18446744073709551615
outSAMattrIHstart                 1
outSJfilterReads                  All
outSJfilterCountUniqueMin         3   1   1   1   
outSJfilterCountTotalMin          3   1   1   1   
outSJfilterOverhangMin            30   12   12   12   
outSJfilterDistToOtherSJmin       10   0   5   10   
outSJfilterIntronMaxVsReadN       50000   100000   200000   
outWigType                        None   
outWigStrand                      Stranded   
outWigReferencesPrefix            -
outWigNorm                        RPM   
outFilterType                     Normal
outFilterMultimapNmax             10
outFilterMultimapScoreRange       1
outFilterScoreMin                 0
outFilterScoreMinOverLread        0.66
outFilterMatchNmin                0
outFilterMatchNminOverLread       0.66
outFilterMismatchNmax             10
outFilterMismatchNoverLmax        0.3
outFilterMismatchNoverReadLmax    1
outFilterIntronMotifs             None
clip5pNbases                      0   
clip3pNbases                      0   
clip3pAfterAdapterNbases          0   
clip3pAdapterSeq                  -   
clip3pAdapterMMp                  0.1   
winBinNbits                       16
winAnchorDistNbins                9
winFlankNbins                     4
winAnchorMultimapNmax             50
winReadCoverageRelativeMin        0.5
winReadCoverageBasesMin           0
scoreGap                          0
scoreGapNoncan                    -8
scoreGapGCAG                      -4
scoreGapATAC                      -8
scoreStitchSJshift                1
scoreGenomicLengthLog2scale       -0.25
scoreDelBase                      -2
scoreDelOpen                      -2
scoreInsOpen                      -2
scoreInsBase                      -2
seedSearchLmax                    0
seedSearchStartLmax               50
seedSearchStartLmaxOverLread      1
seedPerReadNmax                   1000
seedPerWindowNmax                 50
seedNoneLociPerWindow             10
seedMultimapNmax                  10000
alignIntronMin                    21
alignIntronMax                    0
alignMatesGapMax                  0
alignTranscriptsPerReadNmax       10000
alignSJoverhangMin                5
alignSJDBoverhangMin              3
alignSJstitchMismatchNmax         0   -1   0   0   
alignSplicedMateMapLmin           0
alignSplicedMateMapLminOverLmate    0.66
alignWindowsPerReadNmax           10000
alignTranscriptsPerWindowNmax     100
alignEndsType                     Local
alignSoftClipAtReferenceEnds      Yes
alignEndsProtrude                 0   ConcordantPair   
chimSegmentMin                    0
chimScoreMin                      0
chimScoreDropMax                  20
chimScoreSeparation               10
chimScoreJunctionNonGTAG          -1
chimJunctionOverhangMin           20
chimOutType                       SeparateSAMold
chimFilter                        banGenomicN   
chimSegmentReadGapMax             0
sjdbGTFfile                       -
sjdbGTFchrPrefix                  -
sjdbGTFfeatureExon                exon
sjdbGTFtagExonParentTranscript    transcript_id
sjdbGTFtagExonParentGene          gene_id
sjdbOverhang                      100
sjdbScore                         2
sjdbInsertSave                    Basic
quantMode                         GeneCounts   
quantTranscriptomeBAMcompression    1
quantTranscriptomeBan             IndelSoftclipSingleend
twopass1readsN                    18446744073709551615
twopassMode                       None
----------------------------------------

   readsCommandsFile:
echo FILE 0

WARNING: --limitBAMsortRAM=0, will use genome size as RAM limit for BAM sorting
Finished loading and checking parameters
Reading genome generation parameters:
versionGenome                 20201        ~RE-DEFINED
genomeFastaFiles              Homo_sapiens.GRCh38.dna.primary_assembly.fa        ~RE-DEFINED
genomeSAindexNbases           14     ~RE-DEFINED
genomeChrBinNbits             18     ~RE-DEFINED
genomeSAsparseD               1     ~RE-DEFINED
sjdbOverhang                  74     ~RE-DEFINED
sjdbFileChrStartEnd           -        ~RE-DEFINED
sjdbGTFfile                   Homo_sapiens.GRCh38.83.gtf     ~RE-DEFINED
sjdbGTFchrPrefix              -     ~RE-DEFINED
sjdbGTFfeatureExon            exon     ~RE-DEFINED
sjdbGTFtagExonParentTranscripttranscript_id     ~RE-DEFINED
sjdbGTFtagExonParentGene      gene_id     ~RE-DEFINED
sjdbInsertSave                Basic     ~RE-DEFINED
Genome version is compatible with current STAR version
Number of real (reference) chromosomes= 194
1   1   248956422   0
2   10  133797422   249036800
3   11  135086622   382992384
4   12  133275309   518258688
5   13  114364328   651689984
6   14  107043718   766246912
7   15  101991189   873463808
8   16  90338345    975699968
9   17  83257441    1066139648
10  18  80373285    1149501440
11  19  58617616    1229979648
12  2   242193529   1288699904
13  20  64444167    1530920960
14  21  46709983    1595408384
15  22  50818468    1642332160
16  3   198295559   1693188096
17  4   190214555   1891631104
18  5   181538259   2081947648
19  6   170805979   2263613440
20  7   159345973   2434531328
21  8   145138636   2593914880
22  9   138394717   2739142656
23  MT  16569   2877554688
24  X   156040895   2877816832
25  Y   57227415    3034054656
26  KI270728.1  1872759 3091464192
27  KI270727.1  448248  3093561344
28  KI270442.1  392061  3094085632
29  KI270729.1  280839  3094609920
30  GL000225.1  211173  3095134208
31  KI270743.1  210658  3095396352
32  GL000008.2  209709  3095658496
33  GL000009.2  201709  3095920640
34  KI270747.1  198735  3096182784
35  KI270722.1  194050  3096444928
36  GL000194.1  191469  3096707072
37  KI270742.1  186739  3096969216
38  GL000205.2  185591  3097231360
39  GL000195.1  182896  3097493504
40  KI270736.1  181920  3097755648
41  KI270733.1  179772  3098017792
42  GL000224.1  179693  3098279936
43  GL000219.1  179198  3098542080
44  KI270719.1  176845  3098804224
45  GL000216.2  176608  3099066368
46  KI270712.1  176043  3099328512
47  KI270706.1  175055  3099590656
48  KI270725.1  172810  3099852800
49  KI270744.1  168472  3100114944
50  KI270734.1  165050  3100377088
51  GL000213.1  164239  3100639232
52  GL000220.1  161802  3100901376
53  KI270715.1  161471  3101163520
54  GL000218.1  161147  3101425664
55  KI270749.1  158759  3101687808
56  KI270741.1  157432  3101949952
57  GL000221.1  155397  3102212096
58  KI270716.1  153799  3102474240
59  KI270731.1  150754  3102736384
60  KI270751.1  150742  3102998528
61  KI270750.1  148850  3103260672
62  KI270519.1  138126  3103522816
63  GL000214.1  137718  3103784960
64  KI270708.1  127682  3104047104
65  KI270730.1  112551  3104309248
66  KI270438.1  112505  3104571392
67  KI270737.1  103838  3104833536
68  KI270721.1  100316  3105095680
69  KI270738.1  99375   3105357824
70  KI270748.1  93321   3105619968
71  KI270435.1  92983   3105882112
72  GL000208.1  92689   3106144256
73  KI270538.1  91309   3106406400
74  KI270756.1  79590   3106668544
75  KI270739.1  73985   3106930688
76  KI270757.1  71251   3107192832
77  KI270709.1  66860   3107454976
78  KI270746.1  66486   3107717120
79  KI270753.1  62944   3107979264
80  KI270589.1  44474   3108241408
81  KI270726.1  43739   3108503552
82  KI270735.1  42811   3108765696
83  KI270711.1  42210   3109027840
84  KI270745.1  41891   3109289984
85  KI270714.1  41717   3109552128
86  KI270732.1  41543   3109814272
87  KI270713.1  40745   3110076416
88  KI270754.1  40191   3110338560
89  KI270710.1  40176   3110600704
90  KI270717.1  40062   3110862848
91  KI270724.1  39555   3111124992
92  KI270720.1  39050   3111387136
93  KI270723.1  38115   3111649280
94  KI270718.1  38054   3111911424
95  KI270317.1  37690   3112173568
96  KI270740.1  37240   3112435712
97  KI270755.1  36723   3112697856
98  KI270707.1  32032   3112960000
99  KI270579.1  31033   3113222144
100 KI270752.1  27745   3113484288
101 KI270512.1  22689   3113746432
102 KI270322.1  21476   3114008576
103 GL000226.1  15008   3114270720
104 KI270311.1  12399   3114532864
105 KI270366.1  8320    3114795008
106 KI270511.1  8127    3115057152
107 KI270448.1  7992    3115319296
108 KI270521.1  7642    3115581440
109 KI270581.1  7046    3115843584
110 KI270582.1  6504    3116105728
111 KI270515.1  6361    3116367872
112 KI270588.1  6158    3116630016
113 KI270591.1  5796    3116892160
114 KI270522.1  5674    3117154304
115 KI270507.1  5353    3117416448
116 KI270590.1  4685    3117678592
117 KI270584.1  4513    3117940736
118 KI270320.1  4416    3118202880
119 KI270382.1  4215    3118465024
120 KI270468.1  4055    3118727168
121 KI270467.1  3920    3118989312
122 KI270362.1  3530    3119251456
123 KI270517.1  3253    3119513600
124 KI270593.1  3041    3119775744
125 KI270528.1  2983    3120037888
126 KI270587.1  2969    3120300032
127 KI270364.1  2855    3120562176
128 KI270371.1  2805    3120824320
129 KI270333.1  2699    3121086464
130 KI270374.1  2656    3121348608
131 KI270411.1  2646    3121610752
132 KI270414.1  2489    3121872896
133 KI270510.1  2415    3122135040
134 KI270390.1  2387    3122397184
135 KI270375.1  2378    3122659328
136 KI270420.1  2321    3122921472
137 KI270509.1  2318    3123183616
138 KI270315.1  2276    3123445760
139 KI270302.1  2274    3123707904
140 KI270518.1  2186    3123970048
141 KI270530.1  2168    3124232192
142 KI270304.1  2165    3124494336
143 KI270418.1  2145    3124756480
144 KI270424.1  2140    3125018624
145 KI270417.1  2043    3125280768
146 KI270508.1  1951    3125542912
147 KI270303.1  1942    3125805056
148 KI270381.1  1930    3126067200
149 KI270529.1  1899    3126329344
150 KI270425.1  1884    3126591488
151 KI270396.1  1880    3126853632
152 KI270363.1  1803    3127115776
153 KI270386.1  1788    3127377920
154 KI270465.1  1774    3127640064
155 KI270383.1  1750    3127902208
156 KI270384.1  1658    3128164352
157 KI270330.1  1652    3128426496
158 KI270372.1  1650    3128688640
159 KI270548.1  1599    3128950784
160 KI270580.1  1553    3129212928
161 KI270387.1  1537    3129475072
162 KI270391.1  1484    3129737216
163 KI270305.1  1472    3129999360
164 KI270373.1  1451    3130261504
165 KI270422.1  1445    3130523648
166 KI270316.1  1444    3130785792
167 KI270340.1  1428    3131047936
168 KI270338.1  1428    3131310080
169 KI270583.1  1400    3131572224
170 KI270334.1  1368    3131834368
171 KI270429.1  1361    3132096512
172 KI270393.1  1308    3132358656
173 KI270516.1  1300    3132620800
174 KI270389.1  1298    3132882944
175 KI270466.1  1233    3133145088
176 KI270388.1  1216    3133407232
177 KI270544.1  1202    3133669376
178 KI270310.1  1201    3133931520
179 KI270412.1  1179    3134193664
180 KI270395.1  1143    3134455808
181 KI270376.1  1136    3134717952
182 KI270337.1  1121    3134980096
183 KI270335.1  1048    3135242240
184 KI270378.1  1048    3135504384
185 KI270379.1  1045    3135766528
186 KI270329.1  1040    3136028672
187 KI270419.1  1029    3136290816
188 KI270336.1  1026    3136552960
189 KI270312.1  998 3136815104
190 KI270539.1  993 3137077248
191 KI270385.1  990 3137339392
192 KI270423.1  981 3137601536
193 KI270392.1  971 3137863680
194 KI270394.1  970 3138125824
--sjdbOverhang = 74 taken from the generated genome
Started loading the genome: Mon Jun  5 15:30:06 2017

checking Genome sizefile size: 3190332497 bytes; state: good=1 eof=0 fail=0 bad=0
checking SA sizefile size: 24728921047 bytes; state: good=1 eof=0 fail=0 bad=0
checking /SAindex sizefile size: 1565873619 bytes; state: good=1 eof=0 fail=0 bad=0
Read from SAindex: genomeSAindexNbases=14  nSAi=357913940
nGenome=3190332497;  nSAbyte=24728921047
GstrandBit=32   SA number of indices=5994889950
Shared memory is not used for genomes. Allocated a private copy of the genome.
Genome file size: 3190332497 bytes; state: good=1 eof=0 fail=0 bad=0
Loading Genome ... done! state: good=1 eof=0 fail=0 bad=0; loaded 3190332497 bytes
SA file size: 24728921047 bytes; state: good=1 eof=0 fail=0 bad=0
Loading SA ... done! state: good=1 eof=0 fail=0 bad=0; loaded 24728921047 bytes
Loading SAindex ... done: 1565873619 bytes
Finished loading the genome: Mon Jun  5 15:30:37 2017

Processing splice junctions database sjdbN=348621,   sjdbOverhang=74 
alignIntronMax=alignMatesGapMax=0, the max intron size will be approximately determined by (2^winBinNbits)*winAnchorDistNbins=589824
Jun 05 15:30:38   Loaded database junctions from the generated genome genome_index_38.83/sjdbList.out.tab: 348621 total junctions
alexdobin commented 7 years ago

Hi Tommy,

sorry for belayed reply. You have 188 million junctions, this would require quite a lot of RAM. Each junction add ~150 bases to the genome, i.e ~14GigaBases total extra sequence. I would suggest limiting the number of junctions to a few million. It's probably better to run STAR in a 2-pass mode for each sample, and also add a few million "common" junctions.

Cheers Alex

tommycarstensen commented 7 years ago

Thanks for replying @alexdobin ! I did run it in 2 pass mode already. In the end I made it work by using 128GB of memory and sticking to running it single threaded.

I already used the option --limitSjdbInsertNsj 4000000.

How do I add a few million "common" junctions?

I'm rather new to RNA alignment, so I apologise, if my question is stupid.

Thanks, Tommy

alexdobin commented 7 years ago

Hi Tommy,

sorry, I guess I am not entirely clear about your procedure. It looks like you are running the "manual" 2-step procedure, which involves (i) 1st pass mapping of all samples (ii) 2nd pass mapping with junctions detected in all samples in the 1st pass .

The total number of junctions from all samples is 188M, but after collapsing it should fall below 4M - otherwise the --limitSjdbInsertNsj 4000000 . Please check towards the end of the Log.out file for the number of collapsed junctions (or send it to me), it will look like: "Finished SA search: number of new junctions=7669, old junctions=0"

How much RAM did the machine have where the run failed?

If you insert so many junctions, couple of problems may occur:

  1. The mapping speed is reduced significantly.
  2. The number of multimappers increases at the expense of unique mappers.

Did this happen on your 128GB machine run? If it did, you may want to try a slightly modified 2-pass procedure which consists of:

  1. 1st pass mapping of all samples
  2. Merging SJ.out.tab from all samples, collapsing the junctions, and filtering them to bring the number of "common" junctions (SJ.out.tab.common) down to <1M. For instance, you can filter by the number of samples the junction is detected in, total number of reads in all samples, filter non-canonical junctions harsher etc.
  3. Run 2nd pass for all samples with the SJ.out.tab.common from (2), and each sample SJ.out.tab from (1).

Cheers Alex

tommycarstensen commented 7 years ago

Hi Alex, @alexdobin

Thanks again for replying. It's really beneficial to get your expert advice.

Here are the number of collapsed junctions fromLog.out Log.out.txt:

Jun 06 03:04:14   Finished SA search: number of new junctions=3129482, old junctions=348621

I tried requesting 32GB or 64GB of RAM initially. That failed. 128GB and a bit less than 96GB worked. The machines I used mostly had 256GB of RAM.

Is the merging of SJ.out.tab from all samples, collapsing the junctions and filtering them described in the manual? I'm happy to try an alternative approach that deviates from the one in the manual, if you think that would yield higher quality results.

Thanks a million, Tommy

alexdobin commented 7 years ago

Hi Tommy,

if you are satisfied with the 2-pass run with all junctions, I think there is no reason to go for the "filtered" version. In terms of quality of the results, one of the questions is whether you got a significant reduction of unique mappers. Have you compared the Log.final.out results for the 1st pass with the 2nd pass? You can post these files for one of the samples, and we will discuss it further.

Cheers Alex

tommycarstensen commented 7 years ago

Hi Alex,

We appreciate your help immensely. Sorry for the slow reply. I had to run a few jobs again, because I discovered a bug (typing error).

I noticed I had slightly fewer mapped reads after the second pass compared to the first pass:

out_STAR/38.83/pass1/NA18498/Log.out:BAM sorting: 265254 mapped reads
out_STAR/38.83/pass2/NA18498/Log.out:BAM sorting: 264598 mapped reads

Here are both of the Log.out files for sample NA18498 from pass 1 and pass 2.

Would you do anything differently yourself or should we stick with these best practices?

Thanks, Tommy

alexdobin commented 7 years ago

Hi Tommy,

the change in the number of mapped reads is very small, however, you need too look at the more detailed statistics in the Log.final.out file - please post them for pass1 and pass2. In particular, you would not want many reads to become multi-mappers in the 2nd pass.

Cheers Alex

tommycarstensen commented 7 years ago

Hi Alex,

I'm afraid I see exactly that; i.e. an increase in multi-mapped reads. Is there any way I can prevent it? Here Log.final.out for pass 1 and pass 2. Thanks for your continued efforts with this.

paste out_STAR/38.83/pass1/NA18498/Log.final.out <(cut -d"|" -f2 out_STAR/38.83/pass2/NA18498/Log.final.out)
                                 Started job on |   Jun 26 00:02:16     Jun 28 17:24:45
                             Started mapping on |   Jun 26 00:03:05     Jun 28 19:43:44
                                    Finished on |   Jun 26 02:03:25     Jun 29 16:47:27
       Mapping speed, Million of reads per hour |   20.06       1.91

                          Number of input reads |   40235616        40235616
                      Average input read length |   150     150
                                    UNIQUE READS:                                       UNIQUE READS:
                   Uniquely mapped reads number |   36720531        34800709
                        Uniquely mapped reads % |   91.26%      86.49%
                          Average mapped length |   149.47      149.49
                       Number of splices: Total |   21420981        22260002
            Number of splices: Annotated (sjdb) |   21195691        22241928
                       Number of splices: GT/AG |   21223135        21247461
                       Number of splices: GC/AG |   143416      653082
                       Number of splices: AT/AC |   18607       69522
               Number of splices: Non-canonical |   35823       289937
                      Mismatch rate per base, % |   0.25%       0.24%
                         Deletion rate per base |   0.01%       0.01%
                        Deletion average length |   1.58        1.56
                        Insertion rate per base |   0.01%       0.01%
                       Insertion average length |   1.28        1.30
                             MULTI-MAPPING READS:                                MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   2400558     4003080
             % of reads mapped to multiple loci |   5.97%       9.95%
        Number of reads mapped to too many loci |   21630       99099
             % of reads mapped to too many loci |   0.05%       0.25%
                                  UNMAPPED READS:                                     UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%       0.00%
                 % of reads unmapped: too short |   2.71%       2.96%
                     % of reads unmapped: other |   0.01%       0.36%
                                  CHIMERIC READS:                                     CHIMERIC READS:
                       Number of chimeric reads |   0       0
                            % of chimeric reads |   0.00%       0.00%

Best wishes, Tommy

alexdobin commented 7 years ago

Hi Tommy,

the increase in the multimapping is typical for 2-pass mapping with a large number of junctions, so the solution would be to heavily filter the junctions used in the 2nd pass. Here is approximate strategy:

  1. Collapse the junctions from all samples into a set of unique junctions counting the number of reads per junction from all samples, and the number of samples the junction was detected in. I wrote a simple script that does just that: https://github.com/alexdobin/STAR/blob/master/extras/scripts/sjCollapseSamples.awk

  2. Calculate some statistics on these junctions: number of junctions with different intron motifs (column 5), number of junctions detected in 1,2,3... samples (column 10) etc. This will give you an idea on how to filter these junctions best.

  3. Filter the junctions on: (i) number of samples detected, (ii) total number of unique/multimap reads, (iii) max overhang. You may want to do harsher filtering for non-canonical junctions (col5=0). You would want to bring the number of junctions to <1M.

  4. For the 2nd pass, use --sjdbFileChrStartEnd SJ.filtered /path/to/this/sample/1st/pass/SJ.out.tab where SJ.filtered is the list of filtered junctions from 3, and /path/to/this/sample/1st/pass/SJ.out.tab is the SJ.out.tab of the 1st pass for this one sample.

You may need to adjust the filtering in step 3 to bring the increase in multimapers to no more than 1-2%.

Cheers Alex

favilaco commented 6 years ago

Hi,

I've been trying to generate the index with both STAR/2.5.2 and STAR/2.5.3a with no success (strange, since I was able to do so about 2 months ago).

I've been allocating 40, 80 and 200Gb of RAM in different clusters (to ensure it was not memory related), and the error is always the same (and only few seconds after the job starts):

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/var/spool/pbs/mom_priv/jobs/104619.master17: line 15: 18542 Aborted 

The command I used is:

STAR --runMode genomeGenerate --genomeDir /user/data/Francisco/GENCODE27_GRCh38.p10 --sjdbOverhang 100 --runThreadN 8 \
--genomeFastaFiles /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa \
--sjdbGTFfile /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.annotation.gtf

What can be happening?

Thanks a lot! Francisco

tommycarstensen commented 6 years ago

@favilaco What worked for me was running it single threaded, using --limitSjdbInsertNsj 4000000 and allocating 128GB of memory. Specifically I would replace --runThreadN 8 with --runThreadN 1, if I was you. That could possibly solve your problem.

favilaco commented 6 years ago

@tommycarstensen Thanks a lot for the suggestion!

However, I tried it out and it failed again...: terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc /var/spool/pbs/mom_priv/jobs/...SC: line 15: 22487 Aborted

This is the command I tried (allocating 128Gb of RAM):

STAR --runMode genomeGenerate --genomeDir /user/data/Francisco/GENCODE27_GRCh38.p10 --sjdbOverhang 100 --runThreadN 1 --limitSjdbInsertNsj 4000000 --genomeFastaFiles /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa --sjdbGTFfile /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.annotation.gtf

Does someone else have any other suggestion(s)?

alexdobin commented 6 years ago

Hi @favilaco

you are using the "transcript" FASTA for the genome generation. If your intent is to map to the "transcriptome", you do not need the GTF file. Also, you would need to reduce --genomeChrBinNbits to min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]) since the transcriptome contains a lot of references.

Cheers Alex

favilaco commented 6 years ago

Hi @alexdobin

Thanks for your previous answer.

I have tried using your suggestions:

STAR --runMode genomeGenerate --genomeDir /user/data/Francisco/GENCODE27_GRCh38.p10 --genomeFastaFiles /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa --genomeChrBinNbits 18

But again, it failed...

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Do you have any other ideas about what the issue might be?

Thanks a lot, Francisco

tommycarstensen commented 6 years ago

@alexdobin Sorry for taking so long to get back to you. I had to help with other projects.

I did a scatter matrix of the various parameters for all SJs and non-canonical SJs to begin with. Here for non-canonical SJs: scattersj_noncanonical

1) I used the default --outSJfilterOverhangMin of 30 12 12 12 (non-canonical and 3 * canonical), but I see this distribution for non-canonical and canonical reads. Can you explain that? Non-canonical: histsj_maxo_noncanonical Canonical: histsj_maxo_canonical

2) Can you also explain, why the un-annotated SJs have a drop for the large max overhang, whereas this is not observed for the annotated SJs? Compare with plots above for (non)canonical SJs. histsj_maxo_annotated

3) I also noticed the intron motif lengths of the non-canonical SJs having a peak at ~40kbp. Can you explain this peak? Would you recommend filtering based on length? The peak at 100kbp is just due to me applying a threshold. histsj_length_noncanonical I have some intron motifs of similar length for the canonical SJs: histsj_length_canonical

4) And finally, do you know, why a lot of annotated SJs are covered by multi-mapped reads? The values are capped at 2000. histsj_nm_annotated

5) Just one more question. This one is less important. Would you recommend applying separate thresholds for canonical and non-canonical reads and would you apply it only for absolute values or also fractions between unique and multi mappers? histsj_nu_noncanonical histsj_nu_div_nusumnm_noncanonical

6) P.S. I noticed the definition of a canonical splice junction is missing from the manual and the paper. Is the exact definition written down somewhere?

tommycarstensen commented 6 years ago

@favilaco Does your --genomeDir exist? Have you tried with the --limitGenomeGenerateRAM flag? I think the latter will solve your problem.

alexdobin commented 6 years ago

Hi Tommy,

interesting observations, I will look into it carefully on Monday. Would you mind starting a new thread, or is it too much work to copy the graphs and text. Also, ideally, such interesting topics should go to the google-group https://groups.google.com/forum/#!forum/rna-star while GitHub "issues" are more about technical issues.

Cheers Alex

alexdobin commented 6 years ago

Hi @favilaco

I think you need to reduce --genomeChrBinNbits. According to the formula log2[GenomeLength/NumberOfReferences]=log2[MeanTranscriptLength]~log2(2000)~11 So I would try --genomeChrBinNbits 11.

Cheers Alex

favilaco commented 6 years ago

Hi @tommycarstensen and @alexdobin

The folder itself exists:

-bash-4.2$ readlink -f gencode.v27.transcripts.fa 
/user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa

I also incorporated the two parameters you mentioned and finally it worked!

This is the final command:

STAR --runMode genomeGenerate --genomeDir /user/data/Francisco/GENCODE27_GRCh38.p10/ --genomeFastaFiles /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa --genomeChrBinNbits 11 --limitGenomeGenerateRAM 124544990592

Thanks a lot for your help! ;)

Best, Francisco

tommycarstensen commented 6 years ago

@alexdobin I am happy to create a new separate thread on Google Groups. I will do so before Monday. I agree that this is no longer technical and appropriate for GitHub.