falcon-computing / falcon-genome

FCS accelerated version of GATK Best practice in DNA sequencing
Other
0 stars 0 forks source link

Add argument '-L' in GATK stages and supports dynamic interval partitions #25

Open allwu opened 8 years ago

allwu commented 8 years ago

GATK stages such as BaseRecalibrator allows input argument -L which specifies the interest regions for analysis. This is particular useful for exomes analysis since the data is very sparse. Without this the recalibration for exome samples may be inaccurate since the model accounts for a lot of unrelated data regions.

This -L option can be a file or string specifying one or more intervals in the format of chromosomes:position.

One issue caused by this support is it may affect our automatic parallelization based on chromosomes. We may need to parse the intervals in user input and do the separation ourselves. For example, the easiest way could be:

intv.list -->
chr1: 100-200
chr2: 400-6600
intv-1.list -->
chr1: 100-200

intv-2.list -->
chr2: 400-6600

The first step is enable this for BQSR, since the scatter function is automatically done by GATK Queue. We can then see how should we enable this in the future steps.

yaohuFCS commented 8 years ago

When I add the -L argument in the BQSR stage, seems there is an error like below:

ERROR stack trace

org.broadinstitute.gatk.utils.commandline.InvalidArgumentException: Argument with name 'L' isn't defined.

But on other stages it's okay.

allwu commented 7 years ago

Fixing in PR #55.