Would it be possible to run the QC stage in parallel?

NeillGibson commented 8 years ago

Hi,

I like the new (for me) MultiQC report very much but it takes a bit long to create all the QC files compared to the rest of the analysis stages.

I think the QC stage currently runs sequential for all samples. While all other analysis stages run parallel per sample or per genome region.

Is it intended for the QC stage to run sequentially per sample? This older issue https://github.com/chapmanb/bcbio-nextgen/issues/1175 mentions that at that time it was difficult to parallelize stages after variants calling is finished.

For a recent data set of 42 samples that I processed with bcbio 0.9.9 (bwa+ freebayes) running the QC stage takes the longest of all stages.

Analysis stage	Total wall time (hh:mm)	Total wall time / nr samples (hh:mm)
prep_align_inputs	03:33	00:05
process_alignment	26:00	00:37
callable regions	14:00	00:20
variant calling	25:00	00:35
postprocess_variants	03:00	00:04
QC	42:00	01:00

Running the QC stage fully in parallel would reduce the total analysis time from 113 hours to 72 hours. A reduction of ca 36% in total wall time. This percentage would become even much more significant for sets of samples in the hundreds.

The QC stage splits to the following sub tasks run sequentially per sample.

QC Analysis task	Total wall time (hh:mm)
Downsample bam	00:02
FastQC	00:01
Samtools stats	00:11
sample VCF stats	00:41

Another option would be to convert the VCF.gz file to a temporary(?) BCF.gz file using bcftools view -O b .

This would take around 2 hours for my 30 GB vcf.gz (running the conversion now) But this would be more than compensated by reducing the time needed for the sample VCF stats by 50% or 75% ( will test tomorrow. )

Conversion to BCF.gz is something we already would like do for every bcbio output to speed up other bcftools queries that we do downstream significantly. We do still need the vcf.gz though because more tools support it ( I am guessing this is the reason for bcbio outputting vcf.gz )

I do like the VCF stats and the MultiQC report so please don't remove the VCF stats :)

Thank you.

schelhorn commented 8 years ago

I second that request; however, I know that changes in parallelization are often seen as major implementation hurdles and Brad likes to defer to CWL for these cases ;) The issue with bcbio concerning performance is that everything is staged over samples, ie. all samples always have to move over a stage in parallel before the next stage can begin. That is very wasteful since the overall speed is determined by the largest/slowest sample. While this will only change with CWL, perhaps a little tuning of the QC stage could still be implemented for people who cannot switch to CWL until it is stable - for us, more performance directly translates into better research.

NeillGibson commented 8 years ago

I am all for avoiding data processing bottlenecks but understand there might be practical limitations at this time.

Doing the BCFTools stats on binary VCF (= BCF) already reduces the time needed for that step with 83%. The stats output is the same.

Conversion to BCF took 95 minutes in my case and the time reduction per sample would be 36 minutes. So after 3 samples this approach would start saving a lot of time in my case.

BCFTools stats on text VCF

time bcftools stats -s SAMPLE_X -f PASS my.vcf.gz >  SAMPLE_X_text_vcf.stats

real    42m38.426s
user    42m1.180s
sys     0m16.380s

BCFTools stats on binary BCF

time bcftools  stats -s SAMPLE_X -f PASS my.bcf.gz >  SAMPLE_X_binary_bcf.stats

real    6m39.632s
user    6m28.890s
sys     0m10.461s

NeillGibson commented 8 years ago

Another practical option would be to open a ticket at BCFTools with a request to create a function in BCFTools that can create all the sample stats files in 1 pass over the VCF/BCF file.

This 30GB file is currently read and parsed 42 times, wile the stats for all samples could also be gather in 1 go.

chapmanb commented 8 years ago

Neill and Sven-Eric; Thanks for the discussion, definitely happy to try and improve performance here. As of 0.9.9 QC is parallelized by sample and QC method so each of the individual QC steps can happen in parallel if you have machines to allocate them to. Each of the processes does get multiple cores so could have more internal parallelism if possible. Is this not the behavior you're seeing? If you had less individual machines it would seem more parallel per sample. Maybe if we're not doing a good job of making use of individual cores we could improve QC speeds by moving to the single core cluster setup for variant calling?

To Sven-Eric's point, the ongoing CWL implementation does a much better job of this. Individual QC tasks can happen at the same time as other processes in the pipeline and get resource allocation based on their needs. So better methods of parallelizing are ongoing.

Independent of all that it sounds like the practical thing we could do is try to speed up coverage and variant metrics, which is taking up 2/3 of the time. @lpantano do you have thoughts on good ways to do this? We could definitely explore making bcf files or doing less passes, or also trying to pre-calculate coverage and other metrics. It sounds like an optimization here will help a lot and is needed independently of how we end up parallelizing (for additional gains). Does that sound like a reasonable place to start?

NeillGibson commented 8 years ago

Hi @chapmanb Thank you for the answer. I parsed the 42 hours wall time for QC from the log file of bcbio v0.9.9.

So either I made a mistake in parsing / understanding the log file or my v0.9.9 installation somehow did not do the QC in parallel. I will double check this later this week by looking again at the log and looking up the exact installed version (which commit)

Good to know at least that there should be parallelism for the QC in bcbio v.0.9.9 ! :)

If I can get my bcbio 0.9.9 installation to do the QC in parallel then the other point are more nice to haves. I mentioned these more for in the case that QC could not be run in parallel.

NeillGibson commented 8 years ago

I double checked my log and it really took 42 hours sequentially running on 1 machine while I specified enough cpu for mapping and variant calling to run in 6 jobs of 10 cpu over multiple machines.

I started a new analysis for just two samples with 45 cpu and again the bcbio log files indicate that QC ran sequentially in just one machine.

The bcbio version I am running is 0.9.9. I am not sure how to get the exact commit.

bcbio_nextgen.py -v
0.9.9

So somehow the QC stage does not run in parallel for me. Is there a way that I can troubleshoot this?

The only custom thing to my installation as far as I know is that I put the default number of cores to use to 10 in bcbio_system.yaml (our machines have multiple of 10 cpu not the original default of 16).

lpantano commented 8 years ago

Hi @NeillGibson

Can I get the debug log file? I would like to try to look at it further, and see if the problem is when is calculating the resources rather than the QC parallelization.

sorry about this, i understand is painful.

NeillGibson commented 8 years ago

Hi @lpantano . I send you an email on your iscb.org mail address with the bcbio debug logs for the large and small analysis.

lpantano commented 8 years ago

Thanks.

Can you check if any of the commands for the QC step were using 10 cores? I am seeing that for the small test, bcbio is configuring only 1 job with 10 cores for the last chunk of the analysis that includes many steps including the QC one. Actually the same happens with the big job (see below).

I don’t know why with 42 samples is only creating 1 job with 10 cores, instead of 4 jobs with 10 cores. I think I would need @chapman helps here, because it seems a problem in the resource calculation somehow, or sometimes the number of jobs for the previous steps is not quite compatible with QC steps resources.

[2016-09-05T07:56Z] grid_master: Configuring 39 jobs to run, using 1 cores each with 3.50g of memory reserved for each job [2016-09-05T07:57Z] grid_master: Timing: alignment post-processing [2016-09-05T07:57Z] grid_master: ipython: piped_bamprep [2016-09-05T08:02Z] grid_master: Timing: variant calling [2016-09-05T08:02Z] grid_master: ipython: variantcall_sample [2016-09-05T08:03Z] grid_master: ipython: concat_variant_files [2016-09-05T08:05Z] gridn20: Resource requests: ; memory: 1.00; cores: 1 [2016-09-05T08:05Z] gridn20: Configuring 1 jobs to run, using 1 cores each with 1.00g of memory reserved for each job [2016-09-05T12:27Z] grid_master: Resource requests: bcbio_variation, fastqc, gatk, gemini, qsignature, samtools; memory: 3.00, 3.00, 3.50, 3.00, 3.00, 3.00; cores: 10, 10, 1, 10, 10, 10 [2016-09-05T12:27Z] grid_master: Configuring 1 jobs to run, using 10 cores each with 35.1g of memory reserved for each job [2016-09-05T12:28Z] grid_master: Timing: joint squaring off/backfilling [2016-09-05T12:28Z] grid_master: ipython: square_batch_region [2016-09-05T12:28Z] grid_master: ipython: concat_variant_files [2016-09-05T12:28Z] grid_master: Timing: variant post-processing [2016-09-05T12:28Z] grid_master: ipython: postprocess_variants [2016-09-05T15:35Z] grid_master: ipython: split_variants_by_sample [2016-09-05T15:35Z] grid_master: Timing: prepped BAM merging [2016-09-05T15:35Z] grid_master: Timing: validation [2016-09-05T15:35Z] grid_master: ipython: compare_to_rm [2016-09-05T15:35Z] grid_master: Timing: ensemble calling [2016-09-05T15:35Z] grid_master: Timing: validation summary [2016-09-05T15:35Z] grid_master: Timing: structural variation precall [2016-09-05T15:35Z] grid_master: ipython: detect_sv [2016-09-05T15:35Z] grid_master: Timing: structural variation [2016-09-05T15:35Z] grid_master: ipython: detect_sv [2016-09-05T15:35Z] grid_master: Timing: structural variation ensemble [2016-09-05T15:35Z] grid_master: ipython: detect_sv [2016-09-05T15:35Z] grid_master: Timing: structural variation validation [2016-09-05T15:35Z] grid_master: ipython: validate_sv [2016-09-05T15:35Z] grid_master: Timing: heterogeneity [2016-09-05T15:35Z] grid_master: ipython: heterogeneity_estimate [2016-09-05T15:35Z] grid_master: Timing: population database [2016-09-05T15:35Z] grid_master: ipython: prep_gemini_db [2016-09-05T17:50Z] grid_master: Timing: quality control [2016-09-05T17:50Z] grid_master: ipython: pipeline_summary

On Sep 22, 2016, at 9:55 AM, NeillGibson notifications@github.com wrote:

Hi @lpantano https://github.com/lpantano . I send you an email on your iscb.org mail address with the bcbio debug logs for the large and small analysis.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/1556#issuecomment-248910255, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HLGU2kbLQvha08sBp62_Mmk8n8Afks5qsojEgaJpZM4J7yZz.

NeillGibson commented 8 years ago

Sub-sampeling ran with 10 cpu for the small analysis. See the sambamba view -t 10 part of the command.

[2016-09-20T02:39Z] gridn19: /data/prod/Tools/bcbio-0.9.9/anaconda/bin/sambamba view -t 10  
-f bam -o /gpfs/gpfs1/data/run/Projects//DA-697/DA_697_samples-merged/work/qc/SXS_102575_03/tx/tmpolWdiy/SXS_102575_03-sort-downsample.bam
--subsample=0.223 --subsampling-seed=42 /gpfs/gpfs1/data/run/Projects//DA-697/DA_697_samples-merged/work/align/SXS_102575_03/SXS_102575_03-sort.bam

NeillGibson commented 8 years ago

Sub-sampeling also ran with 10 cpu for the large analysis. (just also checked that command log)

NeillGibson commented 8 years ago

By the way BCFTools has an option to collect the stats for all samples at once. -s, --samples <list> list of samples for sample stats, "-" to include all samples

I tried bcftools stats -s -

The output file will contain a section with the following header that has a line per sample.

# PSC, Per-sample counts
# PSC   [2]id   [3]sample       [4]nRefHom      [5]nNonRefHom   [6]nHets        [7]nTransitions [8]nTransversions       [9]nIndels      [10]average depth       [11]nSingletons

chapmanb commented 8 years ago

Lorena and Neill; Thanks so much for debugging and all the details. I dug into this and realized that the logic did not do a great job of providing cores when there wasn't multiple structural variant callers. I fixed the logic so it should now be more aggressive about core usage and correctly parallelize by samples and QC method over more machines. If you update to the latest development and re-run let us know how it looks for speed. Happy to try and do more optimizations beyond this if that will help. Thanks again for the pointers and suggestions.

NeillGibson commented 8 years ago

Thank you for fixing this parallelization issue.
I will try to test it end of next week but it looks like it should do the job.

NeillGibson commented 8 years ago

Thank you! QC now runs in parallel for me with the bwa + freebayes workflow. I set max n of cpu to 45 and 4 jobs of 10 cpu are created and executed for QC.

grid-master: Resource requests: bcbio_variation, fastqc, gatk, gemini, qsignature, samtools; memory: 3.00, 3.00, 3.50, 3.00, 3.00, 3.00; cores: 10, 10, 1, 10, 10, 10
grid-master: Configuring 4 jobs to run, using 10 cores each with 35.1g of memory reserved for each job

lpantano commented 7 years ago

Hi @NeillGibson

I will close this, but feel free to open another issue if still you find any issue.

thanks

bcbio / bcbio-nextgen

Would it be possible to run the QC stage in parallel? #1556