Oshlack / necklace

Combine reference and assembled transcriptomes for RNA-Seq analysis
https://github.com/Oshlack/necklace/wiki
GNU General Public License v3.0
21 stars 5 forks source link

CPU count not passed to HISAT2 #3

Closed acesnik closed 5 years ago

acesnik commented 6 years ago

Hi there,

This is an interesting tool. Thanks for developing it and making it open source! I'm giving it a shot on some data we're interested in (PC3 prostate cancer cell line data).

I'm finding that the number of threads isn't passed into the HISAT2 command, so it's taking quite a while, even though I see from your code that it should be.

AC

Here's the command log for my test run:

####################################################################################################
# Starting pipeline at Fri May 11 23:28:34 GMT 2018
# Input files:  input_template.config
# Output Log:  .bpipe/logs/6891.log
# Stage set_input
# Stage run_check
echo "Running necklace version 1.00" ;      echo "Using 24 threads" ;             echo "Checking for the data files..." ;       for i in /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.81.gff3 /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.81.gff3 /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa /mnt/e/ProjectsActive/Prostate/PC3Alignments/SRX1603582_1-trimmed-pair1.fastq.gz /mnt/e/ProjectsActive/Prostate/PC3Alignments/SRX1603582_1-trimmed-pair2.fastq.gz ;                   do ls $i 2>/dev/null || { echo "CAN'T FIND $i..." ;         echo "PLEASE FIX PATH... STOPPING NOW" ; exit 1  ; } ;         done ;             echo "All looking good" ;             echo "running  necklace version 1.00.. checks passed" > checks_passed
# Stage de_novo_assemble (2)
# Stage build_relatives_superTranscriptome (3)
# Stage build_genome_index (1)
/mnt/e/testNecklace/necklace-1.00/tools/bin/Trinity --seqType fq --max_memory 50G --normalize_reads --left /mnt/e/ProjectsActive/Prostate/PC3Alignments/SRX1603582_1-trimmed-pair1.fastq.gz  --right /mnt/e/ProjectsActive/Prostate/PC3Alignments/SRX1603582_1-trimmed-pair2.fastq.gz --CPU 22 --full_cleanup ; mv trinity_out_dir.Trinity.fasta de_novo_assembly/de_novo_assembly.fasta
cat /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.81.gff3 > /mnt/e/testNecklace/necklace-1.00/relatives_superTranscriptome/related_species_annotations_combined.gtf ; /mnt/e/testNecklace/necklace-1.00/tools/bin/stringtie --merge  -G /mnt/e/testNecklace/necklace-1.00/relatives_superTranscriptome/related_species_annotations_combined.gtf  -o /mnt/e/testNecklace/necklace-1.00/relatives_superTranscriptome/annotation_related_species.merged.gtf  /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.81.gff3 ;  /mnt/e/testNecklace/necklace-1.00/tools/bin/gtf2flatgtf /mnt/e/testNecklace/necklace-1.00/relatives_superTranscriptome/annotation_related_species.merged.gtf  /mnt/e/testNecklace/necklace-1.00/relatives_superTranscriptome/annotation_related_species.flattened.gtf ; /mnt/e/testNecklace/necklace-1.00/tools/bin/gffread /mnt/e/testNecklace/necklace-1.00/relatives_superTranscriptome/annotation_related_species.flattened.gtf -g /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa -w relatives_superTranscriptome/genome_superT.relative.fasta
/mnt/e/testNecklace/necklace-1.00/tools/bin/hisat2-build /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa genome_guided_assembly/genome
# Stage gtf_to_splice_sites (1)
cat /mnt/e/source/repos/Spritz/CMD/bin/Debug/Homo_sapiens.GRCh38.81.gff3 | /mnt/e/testNecklace/necklace-1.00/tools/bin/hisat2_extract_splice_sites.py - > genome_guided_assembly/splicesites.txt
# Stage map_reads_to_genome (1)
/mnt/e/testNecklace/necklace-1.00/tools/bin/hisat2   --known-splicesite-infile genome_guided_assembly/splicesites.txt  --dta --summary-file genome_guided_assembly/mapped2genome.sum -x genome_guided_assembly/genome  -1 /mnt/e/ProjectsActive/Prostate/PC3Alignments/SRX1603582_1-trimmed-pair1.fastq.gz -2 /mnt/e/ProjectsActive/Prostate/PC3Alignments/SRX1603582_1-trimmed-pair2.fastq.gz | /mnt/e/testNecklace/necklace-1.00/tools/bin/samtools view -Su - | /mnt/e/testNecklace/necklace-1.00/tools/bin/samtools sort - -o genome_guided_assembly/genome_mapped.bam
acesnik commented 6 years ago

Oh, I see from tracing necklace.groovy that the reason for this is here: https://github.com/Oshlack/necklace/blob/master/necklace.groovy#L65

acesnik commented 6 years ago

Actually, it might be here, too: https://github.com/Oshlack/necklace/blob/master/bpipe_stages/genome_guided_assembly.groovy#L32

nadiadavidson commented 6 years ago

Hi,

Thanks for trying our pipeline!

Necklace runs HISAT2 and Trinity in parallel and gives just one thread to HISAT2 and all the rest to Trinity. This usually makes sense since in almost all cases HISAT2 will finish long before Trinity does. You could try specifying the number of threads for HISAT2 specifically with the option: -p hisat2_options="-p <>" See https://github.com/Oshlack/necklace/wiki/Options In this case, you'd just need to be aware that the maximum number running on your machine is likely to be: + < n passed with hisat2 string> -1

The mapping and counting steps which happen later on in the pipeline (the ones you've highlighted) should use the correct number of threads I think. The mapping is done for each sample in parallel and so the threads per process is divided by the samples (roughtly) and the counting should use all threads.

Hope this helps and I might update the documentation at some stage to make it clearer.

Cheers, Nadia.