broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.7k stars 591 forks source link

Implement clustering subworkflow for CNV cohort/PoN batching. #5632

Open samuelklee opened 5 years ago

samuelklee commented 5 years ago

@asmirnov239 and Jack Fu from the Talkowski lab are currently implementing this. Subworkflow will first be used to precluster gCNV cohorts/cases, but could also be used for the ModelSegments workflow.

mwalker174 commented 3 years ago

Oops

ldgauthier commented 3 years ago

The Talkowski lab version of this is in R and requires some packages that don't seem to be available anymore as well as the python tool svtk, also developed in their lab. It also localizes all the files with a separate Java program they developed. Their implementation is here (most critically gCNV_Pipeline.Rmd and gCNV_helper.jar): https://github.com/theisaacwong/talkowski/tree/master/gCNV It appears to be under active development.

My simplified implementation is at https://app.terra.bio/#workspaces/broad-firecloud-dsde-methods/gCNV-CMG-test/notebooks/launch/perform_clustering.ipynb but it's still under development with some help from Brian in TAG.

ldgauthier commented 3 years ago

PCA results from the R pipeline for the full CMG cohort are in the bucket associated with my workspace: gs://fc-d3bc13de-ef61-4854-a05c-d311219008b3/pca.rda The rownames in pca$x should be the sample names

Also in the bucket are two bed files. the aux_capture_uniques was used for clustering in the R pipeline and the gencode file was used for calling.

Other notes: gCNV_helper.jar is Java 14 and may have some issues reading .vcf.gz on mac At least a subset of the CMG calls from the R pipeline are in gs://fc-d3bc13de-ef61-4854-a05c-d311219008b3/isaacsCalls.tsv