bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

GATK 2.7 #117

Closed tanglingfung closed 11 years ago

tanglingfung commented 11 years ago

sorry, I cannot find specific code for supporting GATK2.7, can you tell me where it is? i want to add the support of --emitRefConfidence to HaplotypeCaller, please advice.

chapmanb commented 11 years ago

Paul; I try to avoid blocks of version specific code like this, so there isn't a specific place to look for 2.7 tweaks. The approach is to check based on version and apply parameters as needed. Here's an example in the filtering code:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/genotype.py#L269

For adding in the new reference confidence calls, you'd want to add a check here:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/genotype.py#L97

(version >= 2.7 and len(align_bams) == 1)

Thanks for looking at this.

choishingwan commented 11 years ago

I tried to link the gatk2.7 with bcbio-nextgen by changing the soft link under /share/java to point to the latest installation of gatk on my server (remove the original gatk link and create new one by ln -s -T gatk) However, I've encountered a large chunk of error where the last fragment is the following:

/storage/sam/anaconda/lib/python2.7/site-packages/bcbio/provenance/programs.pyc in get_version(name='picard', dirs=None, config={'algorithm': {'aligner': 'bwa', 'coverage_depth': 'high', 'coverage_interval': 'genome', 'mark_duplicates': 'picard', 'max_errors': 2, 'memory_adjust': {'direction': 'decrease', 'magnitude': 2}, 'num_cores': 1, 'platform': 'illumina', 'quality_format': 'Standard', 'realign': 'gatk', ...}, 'custom_algorithms': {'Minimal': {'aligner': ''}, 'RNA-seq': {'aligner': 'tophat', 'transcript_assemble': True}, 'variant2': {'aligner': 'bwa', 'coverage_depth': 'high', 'coverage_interval': 'exome', 'recalibrate': 'gatk', 'variantcaller': 'gatk'}}, 'log_dir': '/storage/sam/Data/log', 'resources': {'bcbio_variation': {'dir': '/storage/sam/share/java/bcbio_variation', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'bwa': {'cmd': 'bwa', 'cores': 16}, 'freebayes': {'memory': '2g'}, 'gatk': {'dir': '/storage/sam/share/java/gatk', 'jvm_opts': ['-Xms750m', '-Xmx2500m']}, 'gatk-haplotype': {'jvm_opts': ['-Xms2g', '-Xmx5500m']}, 'gatk-vqsr': {'jvm_opts': ['-Xms2g', '-Xmx4000m']}, 'gemini': {'cores': 16}, 'log': {'dir': 'log'}, 'novoalign': {'cores': 16, 'memory': '2G'}, 'picard': {'dir': '/storage/sam/share/java/picard'}, ...}}) 137 p = _get_program_file(dirs) 138 else: 139 p = config["resources"]["program_versions"] 140 with open(p) as in_handle: 141 for line in in_handle: --> 142 prog, version = line.rstrip().split(",") 143 if prog == name and version: 144 return version 145 raise KeyError("Version information not found for %s in %s" % (name, p)) 146

ValueError: need more than 1 value to unpack


Did I did something wrong with my gatk installation or was bcbio not supporting the latest gatk? (I have also checked java -version to be 1.7 as per requested by gatk...) Thank you for your helps

chapmanb commented 11 years ago

Thanks much for the report. The symlink approach you described should work, but it seems like something is wrong with your program version file. Could you post a Gist (https://gist.github.com/) of your provenance/programs.txt in the working directory?

Also running bcbio_nextgen.py with a single core (-n 1) during testing will produce less verbose error output and might make the issue easier to spot.

choishingwan commented 11 years ago

Hi, I have made the Gist:

https://gist.github.com/choishingwan/6886050

(Sorry about the format, I am new to Gist...)

When inspecting the content, I remotely remember seeing similar error when using the GATK Queue with the Could not find the main class: org.broadinstitute.sting.gatk.CommandLineGATK Error. However, when using GATK 2.6, I would observe such problem. I can even finish full Queue run based on the best practice from GATK

Thank you for your help

chapmanb commented 11 years ago

The error comes from using a pre-1.7 Java version running GATK 2.7 (or 2.6):

http://stackoverflow.com/questions/10382929/unsupported-major-minor-version-51-0

I know you mention installing 1.7 but it looks like the pipeline is picking up 1.6 instead which causes the issue. I added a check to bcbio-nextgen which should make the origin of the java used more clear in case it is a PATH issue. You can upgrade with bcbio_nextgen.py upgrade -u development or double check the current version on your PATH. Hope this helps.

choishingwan commented 11 years ago

I see I have now solved the problem with some googling.

So what happened with our server is that the original java (java 1.6) was installed at /usr/bin/java whereas our java 1.7 was installed at /software/java-7/jre1.7.0_25/bin/java. In order for us to use the bcbio_nextgen with the latest GATK, we will need to set the JAVA_HOME and also the PATH, pointing at the new java location and that will allow us to run the pipeline without the error message.

Thank you.