Closed tommycarstensen closed 6 years ago
Hi Tommy,
sorry for belayed reply. You have 188 million junctions, this would require quite a lot of RAM. Each junction add ~150 bases to the genome, i.e ~14GigaBases total extra sequence. I would suggest limiting the number of junctions to a few million. It's probably better to run STAR in a 2-pass mode for each sample, and also add a few million "common" junctions.
Cheers Alex
Thanks for replying @alexdobin ! I did run it in 2 pass mode already. In the end I made it work by using 128GB of memory and sticking to running it single threaded.
I already used the option --limitSjdbInsertNsj 4000000
.
How do I add a few million "common" junctions?
I'm rather new to RNA alignment, so I apologise, if my question is stupid.
Thanks, Tommy
Hi Tommy,
sorry, I guess I am not entirely clear about your procedure. It looks like you are running the "manual" 2-step procedure, which involves (i) 1st pass mapping of all samples (ii) 2nd pass mapping with junctions detected in all samples in the 1st pass .
The total number of junctions from all samples is 188M, but after collapsing it should fall below 4M - otherwise the --limitSjdbInsertNsj 4000000 . Please check towards the end of the Log.out file for the number of collapsed junctions (or send it to me), it will look like: "Finished SA search: number of new junctions=7669, old junctions=0"
How much RAM did the machine have where the run failed?
If you insert so many junctions, couple of problems may occur:
Did this happen on your 128GB machine run? If it did, you may want to try a slightly modified 2-pass procedure which consists of:
Cheers Alex
Hi Alex, @alexdobin
Thanks again for replying. It's really beneficial to get your expert advice.
Here are the number of collapsed junctions fromLog.out
Log.out.txt:
Jun 06 03:04:14 Finished SA search: number of new junctions=3129482, old junctions=348621
I tried requesting 32GB or 64GB of RAM initially. That failed. 128GB and a bit less than 96GB worked. The machines I used mostly had 256GB of RAM.
Is the merging of SJ.out.tab from all samples, collapsing the junctions and filtering them described in the manual? I'm happy to try an alternative approach that deviates from the one in the manual, if you think that would yield higher quality results.
Thanks a million, Tommy
Hi Tommy,
if you are satisfied with the 2-pass run with all junctions, I think there is no reason to go for the "filtered" version. In terms of quality of the results, one of the questions is whether you got a significant reduction of unique mappers. Have you compared the Log.final.out results for the 1st pass with the 2nd pass? You can post these files for one of the samples, and we will discuss it further.
Cheers Alex
Hi Alex,
We appreciate your help immensely. Sorry for the slow reply. I had to run a few jobs again, because I discovered a bug (typing error).
I noticed I had slightly fewer mapped reads after the second pass compared to the first pass:
out_STAR/38.83/pass1/NA18498/Log.out:BAM sorting: 265254 mapped reads
out_STAR/38.83/pass2/NA18498/Log.out:BAM sorting: 264598 mapped reads
Here are both of the Log.out
files for sample NA18498 from pass 1 and pass 2.
Would you do anything differently yourself or should we stick with these best practices?
Thanks, Tommy
Hi Tommy,
the change in the number of mapped reads is very small, however, you need too look at the more detailed statistics in the Log.final.out file - please post them for pass1 and pass2. In particular, you would not want many reads to become multi-mappers in the 2nd pass.
Cheers Alex
Hi Alex,
I'm afraid I see exactly that; i.e. an increase in multi-mapped reads. Is there any way I can prevent it? Here Log.final.out
for pass 1 and pass 2. Thanks for your continued efforts with this.
paste out_STAR/38.83/pass1/NA18498/Log.final.out <(cut -d"|" -f2 out_STAR/38.83/pass2/NA18498/Log.final.out)
Started job on | Jun 26 00:02:16 Jun 28 17:24:45
Started mapping on | Jun 26 00:03:05 Jun 28 19:43:44
Finished on | Jun 26 02:03:25 Jun 29 16:47:27
Mapping speed, Million of reads per hour | 20.06 1.91
Number of input reads | 40235616 40235616
Average input read length | 150 150
UNIQUE READS: UNIQUE READS:
Uniquely mapped reads number | 36720531 34800709
Uniquely mapped reads % | 91.26% 86.49%
Average mapped length | 149.47 149.49
Number of splices: Total | 21420981 22260002
Number of splices: Annotated (sjdb) | 21195691 22241928
Number of splices: GT/AG | 21223135 21247461
Number of splices: GC/AG | 143416 653082
Number of splices: AT/AC | 18607 69522
Number of splices: Non-canonical | 35823 289937
Mismatch rate per base, % | 0.25% 0.24%
Deletion rate per base | 0.01% 0.01%
Deletion average length | 1.58 1.56
Insertion rate per base | 0.01% 0.01%
Insertion average length | 1.28 1.30
MULTI-MAPPING READS: MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 2400558 4003080
% of reads mapped to multiple loci | 5.97% 9.95%
Number of reads mapped to too many loci | 21630 99099
% of reads mapped to too many loci | 0.05% 0.25%
UNMAPPED READS: UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00% 0.00%
% of reads unmapped: too short | 2.71% 2.96%
% of reads unmapped: other | 0.01% 0.36%
CHIMERIC READS: CHIMERIC READS:
Number of chimeric reads | 0 0
% of chimeric reads | 0.00% 0.00%
Best wishes, Tommy
Hi Tommy,
the increase in the multimapping is typical for 2-pass mapping with a large number of junctions, so the solution would be to heavily filter the junctions used in the 2nd pass. Here is approximate strategy:
Collapse the junctions from all samples into a set of unique junctions counting the number of reads per junction from all samples, and the number of samples the junction was detected in. I wrote a simple script that does just that: https://github.com/alexdobin/STAR/blob/master/extras/scripts/sjCollapseSamples.awk
Calculate some statistics on these junctions: number of junctions with different intron motifs (column 5), number of junctions detected in 1,2,3... samples (column 10) etc. This will give you an idea on how to filter these junctions best.
Filter the junctions on: (i) number of samples detected, (ii) total number of unique/multimap reads, (iii) max overhang. You may want to do harsher filtering for non-canonical junctions (col5=0). You would want to bring the number of junctions to <1M.
For the 2nd pass, use --sjdbFileChrStartEnd SJ.filtered /path/to/this/sample/1st/pass/SJ.out.tab where SJ.filtered is the list of filtered junctions from 3, and /path/to/this/sample/1st/pass/SJ.out.tab is the SJ.out.tab of the 1st pass for this one sample.
You may need to adjust the filtering in step 3 to bring the increase in multimapers to no more than 1-2%.
Cheers Alex
Hi,
I've been trying to generate the index with both STAR/2.5.2 and STAR/2.5.3a with no success (strange, since I was able to do so about 2 months ago).
I've been allocating 40, 80 and 200Gb of RAM in different clusters (to ensure it was not memory related), and the error is always the same (and only few seconds after the job starts):
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
/var/spool/pbs/mom_priv/jobs/104619.master17: line 15: 18542 Aborted
The command I used is:
STAR --runMode genomeGenerate --genomeDir /user/data/Francisco/GENCODE27_GRCh38.p10 --sjdbOverhang 100 --runThreadN 8 \
--genomeFastaFiles /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa \
--sjdbGTFfile /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.annotation.gtf
What can be happening?
Thanks a lot! Francisco
@favilaco What worked for me was running it single threaded, using --limitSjdbInsertNsj 4000000 and allocating 128GB of memory. Specifically I would replace --runThreadN 8
with --runThreadN 1
, if I was you. That could possibly solve your problem.
@tommycarstensen Thanks a lot for the suggestion!
However, I tried it out and it failed again...:
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc /var/spool/pbs/mom_priv/jobs/...SC: line 15: 22487 Aborted
This is the command I tried (allocating 128Gb of RAM):
STAR --runMode genomeGenerate --genomeDir /user/data/Francisco/GENCODE27_GRCh38.p10 --sjdbOverhang 100 --runThreadN 1 --limitSjdbInsertNsj 4000000 --genomeFastaFiles /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa --sjdbGTFfile /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.annotation.gtf
Does someone else have any other suggestion(s)?
Hi @favilaco
you are using the "transcript" FASTA for the genome generation. If your intent is to map to the "transcriptome", you do not need the GTF file. Also, you would need to reduce --genomeChrBinNbits to min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]) since the transcriptome contains a lot of references.
Cheers Alex
Hi @alexdobin
Thanks for your previous answer.
I have tried using your suggestions:
STAR --runMode genomeGenerate --genomeDir /user/data/Francisco/GENCODE27_GRCh38.p10 --genomeFastaFiles /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa --genomeChrBinNbits 18
But again, it failed...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Do you have any other ideas about what the issue might be?
Thanks a lot, Francisco
@alexdobin Sorry for taking so long to get back to you. I had to help with other projects.
I did a scatter matrix of the various parameters for all SJs and non-canonical SJs to begin with. Here for non-canonical SJs:
1) I used the default --outSJfilterOverhangMin
of 30 12 12 12 (non-canonical and 3 * canonical), but I see this distribution for non-canonical and canonical reads. Can you explain that?
Non-canonical:
Canonical:
2) Can you also explain, why the un-annotated SJs have a drop for the large max overhang, whereas this is not observed for the annotated SJs? Compare with plots above for (non)canonical SJs.
3) I also noticed the intron motif lengths of the non-canonical SJs having a peak at ~40kbp. Can you explain this peak? Would you recommend filtering based on length? The peak at 100kbp is just due to me applying a threshold. I have some intron motifs of similar length for the canonical SJs:
4) And finally, do you know, why a lot of annotated SJs are covered by multi-mapped reads? The values are capped at 2000.
5) Just one more question. This one is less important. Would you recommend applying separate thresholds for canonical and non-canonical reads and would you apply it only for absolute values or also fractions between unique and multi mappers?
6) P.S. I noticed the definition of a canonical splice junction is missing from the manual and the paper. Is the exact definition written down somewhere?
@favilaco Does your --genomeDir
exist? Have you tried with the --limitGenomeGenerateRAM
flag? I think the latter will solve your problem.
Hi Tommy,
interesting observations, I will look into it carefully on Monday. Would you mind starting a new thread, or is it too much work to copy the graphs and text. Also, ideally, such interesting topics should go to the google-group https://groups.google.com/forum/#!forum/rna-star while GitHub "issues" are more about technical issues.
Cheers Alex
Hi @favilaco
I think you need to reduce --genomeChrBinNbits. According to the formula log2[GenomeLength/NumberOfReferences]=log2[MeanTranscriptLength]~log2(2000)~11 So I would try --genomeChrBinNbits 11.
Cheers Alex
Hi @tommycarstensen and @alexdobin
The folder itself exists:
-bash-4.2$ readlink -f gencode.v27.transcripts.fa
/user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa
I also incorporated the two parameters you mentioned and finally it worked!
This is the final command:
STAR --runMode genomeGenerate --genomeDir /user/data/Francisco/GENCODE27_GRCh38.p10/ --genomeFastaFiles /user/data/Francisco/GENCODE27_GRCh38.p10/gencode.v27.transcripts.fa --genomeChrBinNbits 11 --limitGenomeGenerateRAM 124544990592
Thanks a lot for your help! ;)
Best, Francisco
@alexdobin I am happy to create a new separate thread on Google Groups. I will do so before Monday. I agree that this is no longer technical and appropriate for GitHub.
Some of my STAR jobs fail with a std::bad_alloc. Has anyone experienced this before? How did they solve it? Here is the error message I get:
This is the tail of
Log.out
:Here is my command line:
Here are further details from another log file: