broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 589 forks source link

"Extract our data sources" for Funcotator in MuTect2 wdl #6731

Open slzhao opened 4 years ago

slzhao commented 4 years ago

Feature request

Tool(s) or class(es) involved

MuTect2 wdl (mutect2.wdl), task Funcotate

Description

May I know if it sounds like a good idea to add a option to skip the "Extract our data sources" part in mutect2.wdl. I am using mutect2.wdl in HPC system and all the data sources and gnomad for Funcotate were unzipped and ready to use. So there is no need to "Extract data sources" every time (and save time and resources). I can change it and make a pull request if it sounds like a good idea.

The code in mutect2.wdl that I'm going to make an option to skip listed below:

     # Extract our data sources:
     echo "Extracting data sources zip file..."
     mkdir datasources_dir
     tar zxvf ~{data_sources_tar_gz} -C datasources_dir --strip-components 1
     DATA_SOURCES_FOLDER="$PWD/datasources_dir"

     # Handle gnomAD:
     if ~{use_gnomad} ; then
         echo "Enabling gnomAD..."
         for potential_gnomad_gz in gnomAD_exome.tar.gz gnomAD_genome.tar.gz ; do
             if [[ -f ~{dollar}{DATA_SOURCES_FOLDER}/~{dollar}{potential_gnomad_gz} ]] ; then
                 cd ~{dollar}{DATA_SOURCES_FOLDER}
                 tar -zvxf ~{dollar}{potential_gnomad_gz}
                 cd -
             else
                 echo "ERROR: Cannot find gnomAD folder: ~{dollar}{potential_gnomad_gz}" 1>&2
                 false
             fi
         done
     fi

jonn-smith commented 4 years ago

@slzhao This is definitely a reasonable thing to do. Please feel free to issue a PR to do this.

That said, I have to give you a warning. Some of the datasources rely on sqlite3 and therefore have issues on some distributed filesystems (see this post on Lustre/NFS errors). There are a few posts in the GATK forums about this as well.

So you may want to do some testing before using one centralized copy of the data sources.

As a heads-up - I do not have access to a Lustre filesystem so I am unable to do any debugging with it.