Closed mwalker174 closed 1 year ago
Thank you, @mwalker174! This looks very good, and I'm excited to see we're getting closer to a better Scramble integration that can replace the restrictive alternative.
Out of curiosity, is there any reason the scramble docker was not added to build_docker.py
?
This PR incorporates several updates to the Scramble tool and workflow that substantially improve performance and reduce costs on GCP.
cluster_identifier
tool that scans the BAM/CRAM for clusters of split reads, and (2)cluster_analysis
that calls insertion MEIs from the clusters. Note that (2) requires substantially more compute and memory and is now parallelized (see below).cluster_identifier
, and (2) parallelization of thecluster_analysis
R script. Importantly, (1) prevents htslib from automatically downloading the reference from the EBI web server, which causes major issues at scale. This Dockerfile also loads updated versions of htslib libraries and bcftools relative to the previous Scramble docker released by the developers.Final cost estimates are forthcoming but estimated at <$0.10 per sample. We should experience fewer failures and long-running outliers in the future.
Identical output was confirmed on the 1KGP reference panel (on primary contigs).