TheJacksonLaboratory / cs-nf-pipelines

The Jackson Laboratory Computational Sciences Nextflow based analysis pipelines
MIT License
18 stars 10 forks source link

Gold Standard SNP and Indel files #2

Closed quarkstar-codes closed 2 years ago

quarkstar-codes commented 2 years ago

Hello! I understand that this resource is meant for use on the compute clusters in Jackson Lab but I'm trying to setup my own docker containerised pipelines based on yours. I have been looking at the list of gold standard SNP and indel files you have listed in the config files for WES, can you help me understand what it means when the file is annotated GATK formatted? Is there a particular formatting for VCF files that GATK prefers?

MikeWLloyd commented 2 years ago

Hi @quarkstar-codes thanks for your interest in the pipelines! To get running in an external location, the main thing you would need (as you note) are the reference data. Second to that would be a profile configuration for your local envrionment (HPC type, or cloud etc.). The tools for the pipelines are already full Dockerized and in publicly accessible locations, so you shouldn't need to change anything there.

Regarding the files, much of the human data comes from the GATK Hg38 Resource bundle: https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.

The GATK bundle includes the reference genome, gold standard SNP and InDEL files.

The main difference between the GATK resource bundle files, and what you might find from ENSEMBL (or elsewhere) is the naming of the chromosomes, and which unplaced / localized contigs are present (and how they are named).

Hopefully the GATK resource link helps, but feel free to reopen this issue (or put a new one in) if you require anything else.