STAR stalls with large reference (100k references)

dejonggr commented 5 years ago

I'm using a model organism pan-genome which results in ~100000 smaller contigs and it seems like STAR stall shortly after running (the output file only reach ~7.5Mb):

The Log.progress.out:

       Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other

Here the tail of Log.out:

--sjdbOverhang = 99 taken from the generated genome Started loading the genome: Tue Jun 4 13:10:21 2019

Genome: size given as a parameter = 1131448484 SA: size given as a parameter = 8326704857 SAindex: size given as a parameter = 1 Read from SAindex: pGe.gSAindexNbases=14 nSAi=357913940 nGenome=1131448484; nSAbyte=8326704857 GstrandBit=32 SA number of indices=2018595116 Shared memory is not used for genomes. Allocated a private copy of the genome. Genome file size: 1131448484 bytes; state: good=1 eof=0 fail=0 bad=0 Loading Genome ... done! state: good=1 eof=0 fail=0 bad=0; loaded 1131448484 bytes SA file size: 8326704857 bytes; state: good=1 eof=0 fail=0 bad=0 Loading SA ... done! state: good=1 eof=0 fail=0 bad=0; loaded 8326704857 bytes Loading SAindex ... done: 1565873619 bytes Finished loading the genome: Tue Jun 4 13:10:50 2019

Processing splice junctions database sjdbN=436573, pGe.sjdbOverhang=99 alignIntronMax=alignMatesGapMax=0, the max intron size will be approximately determined by (2^winBinNbits)*winAnchorDistNbins=589824 winBinNbits=16 > pGe.gChrBinNbits=0 redefining: winBinNbits=0 Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate1.thread0 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate2.thread0 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate1.thread1 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate2.thread1 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate1.thread2 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate2.thread2 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate1.thread3 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate2.thread3 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate1.thread4 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate2.thread4 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate1.thread5 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate2.thread5 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate1.thread6 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate2.thread6 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate1.thread7 ... ok Opening the file: BN_pan/BNI/bam/NS.1125.001.NEBNext_dual_i7_C11---NEBNext_dual_i5_C11.BNI1_STARtmp//Unmapped.out.mate2.thread7 ... ok Created thread # 1 Created thread # 2 Created thread # 3 Created thread # 4 Created thread # 5 Created thread # 6 Created thread # 7

Is this a memory issue due to the large number of contigs? If so, is there a more nuanced solution apart from concatenating the contigs into a pseudo-chromosome?

Cheers, Grant

alexdobin commented 5 years ago

Hi Grant,

100k contigs should not be that slow, though it’s close to the boundary. What parameters have you used to generate the genome? Could you send me the Log.out file from the genome generation.

Also, please try to map a very small number of reads, say 10k, with --readMapNumber 10000 and send me the Log.final.out file. The problem may be with mappability.

Cheers Alex

Cheers, Alex

dejonggr commented 5 years ago

I actually was running STAR on a reduced file that had only 25000 reads per fastq pair.

I would send the full Log.out file but it's 165M.

I received a number of warnings RE: gene_id but I removed most of them to keep the file size small. I'm not sure why this this is happening given the fact that I included the following commands:

--sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --sjdbGTFtagExonParentGene Parent

Log.out.txt

alexdobin commented 5 years ago

Hi Grant,

the command line for genome generation has = sign which is not allowed: --genomeChrBinNbits = 16 which actually sets this parameter to 0, which might have caused problems in the mapping step.

Also, I would recommend converting the GFF3 to GTF before genome generation.

Cheers Alex

dejonggr commented 5 years ago

--genomeChrBinNbits = 16

This was exactly the problem. Not sure how I missed that! Everthing seems to be running fine now. I'll close the issue when the job is complete. Thanks for your help!

alexdobin commented 5 years ago

Great, thanks for letting me know you resolved it! Cheers Alex

alexdobin / STAR

STAR stalls with large reference (100k references) #656