input arguments and run time

litaifang commented 10 years ago

Hi,

I think I've gotten VarDict to run the past few days, but there are a few questions I don't really understand.

1) When I tried to run VarDict on WGS bam files without inputting region information or bed file, the program looks for things from stdin, hangs there and does nothing. Is it looking for a bed file? Is bed file required to run?

2) If I specify a whole chromosome in the command line, it seems the program tries to read everything into the memory, and then it gets killed (probably due to too much memory request). Is this expected behavior?

3) When I specify a region, at 1000 or 10,000 bp interval for each line, it runs okay. I ran it on a pair of tumor/normal chromosome 22 (about 800MB each), and it took 6-7 hours to complete. Is that more or less expected run time?

4) When I specify successive regions in the bed file, should I indicate overlapping regions, (i.e., 1-5000 in line 1, and 4750-9750 in line 2)?

4/a) Can you elaborate a bit about the bed files you are using internally as the region?

Thank you very much.

-- Li Tai

mjafin commented 10 years ago

Hi Li, I'm not the author of VarDict so hopefully Zhongwu will chime in too, but here goes:

it is required to use a bed file with VarDict. On the other hand, VarDict does not parallelise intrinsically, so processing a whole genome would take weeks. Therefore you will want to split your analysis into very small blocks by the use of bed files.
(No comment)
Sounds normal to me based on my experience
We often use the manufacturers' exome or targeted capture bed files and for whole genomes we estimate the callable regions using bedtools genomecov, mimicking Gatk callable regions tool settings. The analysis is then split into subregions and parallelised.

zhongwulai commented 10 years ago

Hi Li,

Thanks for the feedback.

Miika's right, BED file is required. To parallelize, you need to split BED file into smaller ones. I have utility script for splitting BED files and I'll commit to github soon.
VarDict uses memory linear to individual segment size in BED file. So a whole chromosome will require too much memory and likely will get killed.
In a single process, that's about what's expected. When I tested on dream challenge paired WGS (~350GB total) using 150bp overlapping 5kb segments, it tooks < 4hr in our cluster of ~30 nodes.
As I mentioned, for WGS, 150bp overlaping segments are recommended. I typically use 5kb segments to limit memory for individual jobs, but I've tested using 1-10Mb segments and it still ran fine, though required more memory. The 150bp overlaping is for VarDict to be able to call indels when they span two segments with only softly clipped reads to support them. For targeted sequencing or exome, use manufacturor supplied BED files. For exome, option "-x 100" is recommended if you want to call variants not in BED but might be hybrid captured, as we found some critical CDS were not in SureSelect BED files, thus missing valid, good quality mutations.

Thanks again for using VarDict.

-Zhongwu

litaifang commented 10 years ago

Thanks for the explanation. I'm trying VarDict on some data sets, and I'll provide feedback when they're done.

litaifang commented 10 years ago

Hi Zhongwu, Do you mean 30 cores or 30 nodes?

AstraZeneca-NGS / VarDict

input arguments and run time #2