Open Stikus opened 4 years ago
I hope it would be possible to use --processes
correctly in such case!
That is a good suggestion, thanks!
Wait, there should be more than 440 lines in your targets.bed file. For WGS, I think you need to run the BED file of accessible regions through autobin
and/or target
to subdivide long regions into smaller bins. The batch
command ought to do this automatically -- please let me know if it doesn't.
You are talking about this call: https://github.com/etal/cnvkit/blob/master/cnvlib/batch.py#L89 and this function https://github.com/etal/cnvkit/blob/master/cnvlib/target.py#L9 ?
According to one of our logs
Command: '/usr/local/bin/cnvkit.py batch /data/dragen1/ESSE_03-1552_K1_S24.bam --seq-method wgs --segment-method cbs --processes 96 --output-dir /output/test2_withRef/cnvkitRunDir --reference /ref/GRCh38.d1.vd1/CNVkit//FlatReference.cnn'.
PID=3840 (last job)
CNVkit 0.9.7
Created directory /output/test2_withRef/cnvkitRunDir
Wrote /output/test2_withRef/cnvkitRunDir/FlatReference.target-tmp.bed with 438 regions
Wrote /output/test2_withRef/cnvkitRunDir/FlatReference.antitarget-tmp.bed with 0 regions
Running 1 samples in 96 processes (that's 96 processes per bam)
Running the CNVkit pipeline on /data/dragen1/ESSE_03-1552_K1_S24.bam ...
Processing reads in ESSE_03-1552_K1_S24.bam
Time: 2016.766 seconds (391971 reads/sec, 0 bins/sec)
FlatReference.target-tmp.bed
contains 438 regions as our FlatReference.cnn
.
And FlatReference.cnn
contains this number of regions due to command creating it:
/usr/local/bin/cnvkit.py reference -o /ref/GRCh38.d1.vd1/CNVkit//FlatReference.cnn -f /ref/GRCh38.d1.vd1/GRCh38.d1.vd1.fa -t /ref/GRCh38.d1.vd1/CNVkit//GRCh38.d1.vd1.bed
Target bed
is access bed
in this case.
I'll check our scripts and source and return.
According to source https://github.com/etal/cnvkit/blob/master/cnvlib/commands.py#L62 - when we are using prepared reference.cnn
- no subdividing is happened:
We are in this code block: https://github.com/etal/cnvkit/blob/master/cnvlib/commands.py#L116
I don't see any splitting after. Looks like to split we shouldn't use prepared reference.cnn
My questions:
Is this expected behavior of batch
?
How should we create reference.cnn
to avoid this problem? Split our access.bed
with target
command before creating it? How exactly?
Last question was asked here before:
Does it mean that in wgs mode I should run cnvkit.py
target
command before generating a*.reference.cnn
file as well?
Hello, thx for great tool.
We are trying to process germline WGS samples and encountered that parallelization is not working, After digging into source code we found reason:
Default
chunk_size
is 5000, but in WGS mode access bed contains several hundred lines (440 in our case) and splitting is not working - this causes the processing full genome bam on one thread - executing time is near 30 mins.After manually editing this line and setting
chunk_size
to 10 we got multithreading as expected.Here is my suggestion - either in
to_chunks
or before calling it count optimalchunk_size
by dividing number of lines in.bed
to number ofprocesses
passed to function.For now, we will use modified source code.