egaffo / circompara2

Improved bioinformatic pipeline to identify and quantify circRNA expression from RNA-seq data by combining multiple circRNA detection methods
Other
8 stars 0 forks source link

Running out of RAM when using STAR to generate genome in circompara2 pipeline #16

Closed lovebaboon1989 closed 11 months ago

lovebaboon1989 commented 1 year ago

Hi there, I am trying to apply circompara2 to detect circRNA in human RNAseq dataset, but now I ran into a problem as follows:

The step terminated: cd dbs/indexes/indexes/star/ref-transcripts && STAR --runMode genomeGenerate --runThreadN 1 --genomeFastaFiles /annotation/ref-transcripts.fa --genomeDir . && cd /home Feb 15 01:28:06 ..... started STAR run Feb 15 01:28:06 ... starting to generate Genome files scons: building terminated because of errors.

The error information: EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=31000000000is too small for your genome SOLUTION: please specify --limitGenomeGenerateRAM not less than 144424593450 and make that much RAM available

Feb 15 01:29:04 ...... FATAL ERROR, exiting scons: *** [dbs/indexes/indexes/star/ref-transcripts/chrLength.txt] Error 104

I guess this is because STAR will eat too much RAM when generate genome files, so I made a change in var.py to specify a larger RAM for STAR, but I still get the same error (still same information saying that limitGenomeGenerateRAM=31000000000is too small for your genome), so it seems like the STAR command I updated in var.py doesn't work: META = 'meta.csv' GENOME_FASTA = '../annotation/ref-transcripts.fa' ANNOTATION = '../annotation/ref-transcripts.gtf' CPUS = '1' STAR_PARAMS = ['--limitGenomeGenerateRAM', '160424593450']

Could you please help me about this error? Thanks a lot! Best,

egaffo commented 1 year ago

To build genome indexes with custom parameters, such as the --limitGenomeGenerateRAM of STAR, you have to make the genome index(es) with a separate CirComPara2 run (and command). Follow the instructions here and add the --limitGenomeGenerateRAM parameter to STAR_EXTRA_PARAMS. For other options of the genome index generator script, check the help with [path_to_circompara2_home]/src/utils/bash/make_indexes "-h". With the CirComPara2 Docker container, you need to change the default entry point

docker run -u `id -u` --rm -it -v $(pwd):$(pwd) -w $(pwd) --entrypoint /circompara2/src/utils/bash/make_indexes egaffo/circompara2:v0.1.2.1 '-h'

If you want just the STAR index, the command should look like

docker run -u `id -u` --rm -it -v $(pwd):$(pwd) -w $(pwd) --entrypoint /circompara2/src/utils/bash/make_indexes egaffo/circompara2:v0.1.2.1 'INDEXES="STAR" STAR_EXTRA_PARAMS="--limitGenomeGenerateRAM 160424593450"'

Then, set the precompiled index path as the STAR_INDEX parameter to run CirComPara2.

lovebaboon1989 commented 1 year ago

Hi Egaffo, Thanks for the quick reply, I am using singularity to run the circompara2 image, I would try using your suggestion to see if it works. But I also wonder if I may just skip all STAR-related procedures and methods, by passing some parameter settings to the var.py? That would be much easier for me and saves my computer RAM allocation, thanks! One relating question is I just wonder if all the commented lines in the var.py works or not, because I tried uncomment the lines of CIRCRNA_METHODS and delete circexplorer2_star, but still got the same error, indicating the exclusion is not working, or STAR is also called somewhere else. As follows is the var.py I have when running test dataset of circompara2 pipeline:

META = 'meta.csv' GENOME_FASTA = '../annotation/ref-transcripts.fa' ANNOTATION = '../annotation/ref-transcripts.gtf' CPUS = '4'

pre-computed index and annotation files

GENOME_INDEX = "../indexes/hisat2/CFLAR_HIPK3"

SEGEMEHL_INDEX = "../indexes/segemehl/CFLAR_HIPK3.idx"

BWA_INDEX = "../indexes/bwa/CFLAR_HIPK3"

BOWTIE2_INDEX = "../indexes/bowtie2/CFLAR_HIPK3"

BOWTIE_INDEX = "../indexes/bowtie/CFLAR_HIPK3"

STAR_INDEX = "../indexes/star/CFLAR_HIPK3"

GENEPRED = "../annotation/CFLAR_HIPK3.genePred.wgn"

PREPROCESSOR = 'trimmomatic'

PREPROCESSOR_PARAMS = 'MAXINFO:40:0.5 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:30 MINLEN:35 AVGQUAL:30'

FIX_READ_HEADER = 'True'

HISAT2_EXTRA_PARAMS = '--rna-strandness RF' # stranded libraries

CE2_PARAMS = ['--no-fix'] #suggested not to set '--no-fix' in real datasets

MIN_READS = 2 #default

CIRCRNA_METHODS = 'dcc,ciri,findcirc,testrealign,'\

'circexplorer2_star,circexplorer2_segemehl,'\

'circexplorer2_bwa,circexplorer2_tophat,circrna_finder'

#

TOGGLE_TRANSCRIPTOME_RECONSTRUCTION = 'True'

aligners' custom parameters

parameters from CIRI

BWA_PARAMS = ['-T', '19', '-c', '1']

parameters from CIRCexplorer2

SEGEMEHL_PARAMS = ['-M','1'] #'-D', '0', '-Z', '20'

TOPHAT_PARAMS = ['--max-multihits', '1']#'--zpacker','pigz'

parameters used in DCC manual example

STAR_PARAMS = ['--outFilterMultimapNmax', '1',

'--outSJfilterOverhangMin', '15', '15', '15', '15',

'--alignSJoverhangMin', '15',

'--alignSJDBoverhangMin', '15',

'--seedSearchStartLmax', '30',

'--outFilterScoreMin', '1',

'--outFilterMatchNmin', '1',

'--outFilterMismatchNmax', '2',

'--chimSegmentMin', '15',

'--chimScoreMin', '15',

'--chimScoreSeparation', '10',

'--chimJunctionOverhangMin', '15']

LIN_COUNTER = 'ccp' #'dcc'

#

DCC_EXTRA_PARAMS = ['-fg', '-M', '-Nr', 1, 1, '-F', '-ss']

TESTREALIGN_PARAMS = ['-q', 'median_1'] ## suggested 'median_40'

FINDCIRC_EXTRA_PARAMS = ['--best-qual', '0'] #suggested '40'

SAM_SORT_MM = '1G'

BYPASS = 'linear'

CCP_COUNTS = 'True'

#

CIRC_MAPPING = "{'SE':['STAR','TOPHAT','BOWTIE2'],'PE':['BWA','SEGEMEHL']}"

egaffo commented 1 year ago

commented lines in the vars.py are skipped. STAR is used by DCC, CIRCexplorer_star and circRNA_finder. You also have to remove all those three methods not to run STAR. You can set CIRCRNA_METHODS = 'ciri,findcirc,circexplorer2_segemehl,circexplorer2_bwa,circexplorer2_tophat' However, keep in mind that Segemehl also eats a lot of RAM and requires about 60GB RAM to load the whole human genome index (STAR needs about 32GB). For a machine with <32GB RAM, you could set CIRCRNA_METHODS = 'ciri,findcirc,circexplorer2_bwa,circexplorer2_tophat' I have no experience with Singularity, but it should work similarly to Docker as I know Docker containers can be converted into singularities.

lovebaboon1989 commented 1 year ago

Hi Egaffo, thanks for the reply. I used the STAR-index which I previously generated from previous RNAseq pipeline and used that as the pre-computed STAR-index in var.py, this works now! However, I have another error when building tophat indexes as follows: Error: Couldn't build bowtie index with err = 1 scons: *** [samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/tophat_out/accepted_hits.bam] Error 1

Do you know how to solve this error? Thanks!

egaffo commented 1 year ago

I see you are not using the genome FASTA files from Ensembl or UCSC, but perhaps a custom genome "ref-transcripts.fa" and annotation, which can cause the problem. Check that files and formats are consistent...also, try to google that error.

lovebaboon1989 commented 1 year ago

Ahh I see the difference between my transcripts.fa and Homo_sapiens.GRCh38.dna.primary_assembly.fa which I downloaded from Ensembl, because previously we only focused mRNA expression level in human. Now everything works well and reliable circRNA expression matrix is generated, thanks a lot!