Speed up `get_genome_build`

bschilder commented 3 years ago

get_genome_build is super useful but can be quite slow, since we currently load the entire sumstats file in, and then downsample the SNPs from that.

This doesn't make a difference when we're running the full format_sumstats pipeline bc the sumstats are already loaded into memory anyway. But if you're feeding in a path, it way to speed things up is to read in only the top N rows and just use those as the sample (instead of sampling them genome-wide). This may or may not be as accurate as the genome-wide sampling, but I can test that.

bschilder commented 3 years ago

Was working on this for phenomix, but not sure if this function fits better inMungeSumstats or phenomix.

header_only would be passed to MungeSumstats::get_genome_build.

#' Infer genome builds
#' 
#' Infers the genome build of  summary statistics files (GRCh37 or GRCh38) 
#' from the data. Uses SNP (RSID) & CHR & BP to get genome build.
#'
#' @param nCores Number of threads to use for parallel processes.  
#' @param verbose Print messages.
#' @inheritParams MungeSumstats::get_genome_build
#' 
#' @return ref_genome the genome build of the data
#'
#' @examples
#' #Pass path to Educational Attainment Okbay sumstat file to a temp directory
#'
#' eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt", 
#'                                  package="MungeSumstats")
#' 
#' ## Call uses reference genome as default with more than 2GB of memory,
#' ## which is more than what 32-bit Windows can handle so remove certain checks
#' is_32bit_windows <- 
#' .Platform$OS.type == "windows" && .Platform$r_arch == "i386"
#' if (!is_32bit_windows) {
#' ref_genome <- infer_genome_builds(sumstats=eduAttainOkbayPth)                                  
#'}
#' @export
#' @importFrom MungeSumstats get_genome_build
#' @importFrom parallel mclapply
#' @importFrom utils capture.output
#' @importFrom dplyr %>%
infer_genome_builds <- function(sumstats,
                                header_only=TRUE,
                                sampled_snps=10000,
                                nCores=1,
                                verbose=TRUE){ 
    start <- Sys.time()
    sumstats <- unique(sumstats)
    messager("Inferring genome build of",length(sumstats),
             "sumstats file(s).",
             v=verbose)
    #### Get basename ####
    munged_ids <- gsub(paste(MungeSumstats:::supported_suffixes(),collapse = "|"),
                       "",basename(sumstats))
    #### Infer builds ####
    builds <- parallel::mclapply(sumstats, 
                                 function(x,
                                          .sampled_snps=sampled_snps){
        message_parallel(x)
        MungeSumstats::get_genome_build(sumstats = x, 
                                        sampled_snps = .sampled_snps,
                                        nThread = 1)
    }, mc.cores = nCores) %>% `names<-`(munged_ids) 
    #### Report time ####
    messager(utils::capture.output(difftime(Sys.time(), start)))
    #### Report build counts ####
    build_counts <- table(unlist(builds))
    for(i in seq(1,length(build_counts))){
        messager(names(build_counts)[i],":",build_counts[i],"file(s)",v=verbose)
    } 
    return(builds)
}

bschilder commented 3 years ago

header_only now implemented in get_genome_build. Confirmed using the example data (eduAttainOkbayPth) that both methods predict "GRCH37".

get_genome_builds now implemented as well (renamed from infer_genome_builds to make more consistent with subfunction)

bschilder commented 3 years ago

Probably don't need to export both get_genome_build and get_genome_builds, esp since this might get confusing for users.

Will export the more comprehensiveget_genome_builds only.

Al-Murphy / MungeSumstats

Speed up `get_genome_build` #54