etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
540 stars 164 forks source link

to_chunks splitting in WGS mode #526

Open Stikus opened 4 years ago

Stikus commented 4 years ago

Hello, thx for great tool.

We are trying to process germline WGS samples and encountered that parallelization is not working, After digging into source code we found reason:

"""Split a BED file into chunk_size-line parts for parallelization."""

Default chunk_size is 5000, but in WGS mode access bed contains several hundred lines (440 in our case) and splitting is not working - this causes the processing full genome bam on one thread - executing time is near 30 mins.

After manually editing this line and setting chunk_size to 10 we got multithreading as expected.

Here is my suggestion - either in to_chunks or before calling it count optimal chunk_size by dividing number of lines in .bed to number of processes passed to function.

For now, we will use modified source code.

serge2016 commented 4 years ago

I hope it would be possible to use --processes correctly in such case!

etal commented 4 years ago

That is a good suggestion, thanks!

etal commented 4 years ago

Wait, there should be more than 440 lines in your targets.bed file. For WGS, I think you need to run the BED file of accessible regions through autobin and/or target to subdivide long regions into smaller bins. The batch command ought to do this automatically -- please let me know if it doesn't.

Stikus commented 4 years ago

You are talking about this call: https://github.com/etal/cnvkit/blob/master/cnvlib/batch.py#L89 and this function https://github.com/etal/cnvkit/blob/master/cnvlib/target.py#L9 ?

According to one of our logs

  Command: '/usr/local/bin/cnvkit.py batch /data/dragen1/ESSE_03-1552_K1_S24.bam --seq-method wgs --segment-method cbs --processes 96 --output-dir /output/test2_withRef/cnvkitRunDir --reference /ref/GRCh38.d1.vd1/CNVkit//FlatReference.cnn'.
  PID=3840 (last job)
CNVkit 0.9.7
Created directory /output/test2_withRef/cnvkitRunDir
Wrote /output/test2_withRef/cnvkitRunDir/FlatReference.target-tmp.bed with 438 regions
Wrote /output/test2_withRef/cnvkitRunDir/FlatReference.antitarget-tmp.bed with 0 regions
Running 1 samples in 96 processes (that's 96 processes per bam)
Running the CNVkit pipeline on /data/dragen1/ESSE_03-1552_K1_S24.bam ...
Processing reads in ESSE_03-1552_K1_S24.bam
Time: 2016.766 seconds (391971 reads/sec, 0 bins/sec)

FlatReference.target-tmp.bed contains 438 regions as our FlatReference.cnn. And FlatReference.cnn contains this number of regions due to command creating it:

/usr/local/bin/cnvkit.py reference -o /ref/GRCh38.d1.vd1/CNVkit//FlatReference.cnn -f /ref/GRCh38.d1.vd1/GRCh38.d1.vd1.fa -t /ref/GRCh38.d1.vd1/CNVkit//GRCh38.d1.vd1.bed

Target bed is access bed in this case.

I'll check our scripts and source and return.

Stikus commented 4 years ago

According to source https://github.com/etal/cnvkit/blob/master/cnvlib/commands.py#L62 - when we are using prepared reference.cnn - no subdividing is happened:

  1. We are in this code block: https://github.com/etal/cnvkit/blob/master/cnvlib/commands.py#L116

  2. Here is call of this function

  3. I don't see any splitting after. Looks like to split we shouldn't use prepared reference.cnn

My questions:

  1. Is this expected behavior of batch?

  2. How should we create reference.cnn to avoid this problem? Split our access.bed with target command before creating it? How exactly?

Last question was asked here before:

Does it mean that in wgs mode I should run cnvkit.py target command before generating a *.reference.cnn file as well?