genome / gms

The Genome Modeling System installer
https://github.com/genome/gms/wiki
GNU Lesser General Public License v3.0
78 stars 22 forks source link

Somatic variation fails because of inability to find GATK #47

Closed malachig closed 9 years ago

malachig commented 10 years ago

The first test of somatic-variation hit an error after about 1 min. The top of the error log looks like this:

2013-11-27 16:41:10-0600 clia1: Executing detect variants step
2013-11-27 16:41:14-0600 clia1: 2013/11/27 16:41:14 Genome::Sys: Failed to find jar GenomeAnalysisTK at version 2.4
2013-11-27 16:41:14-0600 clia1: ERROR: Failed to find jar GenomeAnalysisTK at version 2.4
2013-11-27 16:41:14-0600 clia1: 2013/11/27 16:41:14 Genome::Model::Tools::DetectVariants2::Strategy id((gatk-somatic-indel 5336 filtered by false-indel v1 [--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15]) unique union (pindel 0.5 filtered by pindel-somatic-calls v1 then pindel-vaf-filter v1 [--variant-freq-cutoff=0.08] then pindel-read-support v1) unique union (varscan-somatic 2.2.6 filtered by varscan-high-confidence-indel v1 then false-indel v1 [--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15]) unique union (strelka 0.4.6.2 [isSkipDepthFilters = 1])): Could not call has_version on the class Genome::Model::Tools::DetectVariants2::GatkSomaticIndel
2013-11-27 16:41:14-0600 clia1: ERROR: Could not call has_version on the class Genome::Model::Tools::DetectVariants2::GatkSomaticIndel
2013-11-27 16:41:14-0600 clia1: Command module died or returned undef.

We need to determine how software versions are found in the part of DV2...

gatoravi commented 10 years ago

Genome/Sys.pm tries to find GATK version 2.4 using the environment variable $ENV{GENOME_JAR_PATH},

385    my @dirs = split(':', $ENV{GENOME_JAR_PATH});

This variable is set to /usr/share/java on the standalone and within TGI.

Inside the TGI, the /usr/share/java has a symlink GenomeAnalysisTK.jar -> GenomeAnalysisTK-2.4.jar and the jar file GenomeAnalysisTK-2.4.jar On the standalone both the symlink and the jar file are missing.

Looks like this directory has to be replicated on the standalone install, is it just a matter of pointing the environment variable to a different folder where these jars exist already or does the whole directory need to be replicated ?

There are quite a few things that are present in this directory on the TGI end but seem to be missing on the standalone install (picardtools, weka etc).

malachig commented 10 years ago

Note from Scott on this issue:

Most of those envs that are set to a global network path at TGI are set to something inside the sw directory for the standalone box. There should be some tgz with some Java stuff next-to the apps*.tgz.

The java tgz may not have the latest stuff. If so there needs to be a new java tgz made, with he added stuff, with a different date, and the makefile should be updated to download it too .

Some of the java stuff has been packaged as debs. I think Allison did this for gatk. If it is, a fresh genome-snapshot-deps package should solve the problem. The best people to talk with about the state of that are Matt Callaway and Nathan Nutter. Matt was trying to take it past the state in which I left it. There were directories for Ubuntu Lucid and Precise, and the precise directory should have equivalents of the same packages. Where this was not possible, the files listing those deps are broken out into a *.missing file.

malachig commented 10 years ago

Currently the only things in 'java-2013-08-27.tgz' are: rdp-classifier, samtools, VarScan*, and weka.jar

The only environment variable related to JAVA that is defined globally is as far as I can tell is: GENOME_SW_LEGACY_JAVA=/opt/gms/4K8W670/sw/java

GENOME_JAR_PATH appears to be specified in: Genome/Env/GENOME_JAR_PATH.pm

This does seem to work in the standalone GMS: % perl -e 'use Genome; $test=$ENV{GENOME_JAR_PATH}; print "\n$test\n"'

/usr/share/java

Even if we add GATK to the JAVA archive, it is not obvious to me from the makefile how the contents are meant to be found... If the system looks for them in '/usr/share/java' ... I see no active attempt to place them there. Perhaps that is only where properly packaged JAVA stuff goes?

malachig commented 10 years ago

To see what is currently in genome-snapshot-deps within TGI, go to the top level of a 'genome' checkout and run:

% cd /gscuser/mgriffit/git/genome/ % git submodule update --init genome-snapshot-deps % cd genome-snapshot-deps/precise/ % grep -i GATK *

All I see is this: genome-snapshot-deps-apps-external.depends.missing:libgatk-protected-java (>= 2.4-1)

Debian packages get into the standalone GMS something like this:

malachig commented 10 years ago

It looks like 'libgatk-protected-java' is marked as 'missing' in the precise (ubuntu 12.04) version of genome-snapshot-deps but as 'depends' in the lucid (Ubuntu 10.04) version.

One thing we could try is to attempt to install it directly in the GMS, if it works, then we could take it out of the missing list for precise, put it in the regular list, and rebuild the apt repo for the standalone box.

Another option is to use one of the 8 versions of GATK that are currently in the 'apps' repo from /gsc/pkg/bio in the TGI.

These seem to be very old versions of GATK though. And the one expected in the test analysis is GATK version 2.4. IT seems that Genome::Sys is expecting this tool to be installed as a package in a standard way so that the version can be resolved as well.

malachig commented 10 years ago

One thing about that package is that it should not actually be distributed outside of TGI. It contains a patch that disables the "phone home" behavior. Even though the modification is in the "public" source tree, the package includes code from the "protected" source tree, which is under a license that does not allow redistribution.

Our GATK wrappers depend on the phone home behavior being disabled because they always insert the "-et NO_ET" argument into the command line. If you try this against a jar file without our patch it will give an error because they require an additional argument with a key provided by the Broad to allow the phone home override. Our patch skips this check.

malachig commented 10 years ago

Since we can not re-distribute GATK we will need to setup the standalone GMS in such a way that allows the user the manually install GATK after obtaining the appropriate permissions from the Broad.

gatoravi commented 10 years ago

This is how you get earlier versions of the GATK, Note - older version binaries are not distributed by the Broad, they have to compiled from source.

"Get the package for the version you want from this page: https://github.com/broadgsa/gatk-protected/tags

From your terminal/console, navigate to the directory containing the source code. There, you run the command:

ant clean dist

This will do everything for you. The compiled binary will be in the newly-created dist directory."

gatoravi commented 9 years ago

I'm seeing this same issue on a somatic-variation build. Looks like the gms-pub specific change was reverted in the recent merge/refactor in master. Compare https://github.com/genome/genome/commit/1b8a40cd0936dfa859d0acb510fc57e0c9bdc05f and https://github.com/genome/genome/blob/master/lib/perl/Genome/Model/Tools/Gatk/Base.pm#L64

Figure out if master can be modified to use consistent paths.

gatoravi commented 9 years ago

not sure where to bring this up on master, Issues seem to be disabled for that repo. cc'ing @nnutter for ideas.

gatoravi commented 9 years ago

This was added to the SGMS branch here, https://github.com/genome/genome/commit/b62e2b09651ce966bfc0b21b0ddde49c89b58229 We are divergent from master on this and might come up again.

gatoravi commented 9 years ago

At closer look, it looks like master has an improved way of looking for JAR paths. We are still using the old method which we think is ok for now.