Closed bschilder closed 3 years ago
Was working on this for phenomix
, but not sure if this function fits better inMungeSumstats
or phenomix
.
header_only
would be passed to MungeSumstats::get_genome_build
.
#' Infer genome builds
#'
#' Infers the genome build of summary statistics files (GRCh37 or GRCh38)
#' from the data. Uses SNP (RSID) & CHR & BP to get genome build.
#'
#' @param nCores Number of threads to use for parallel processes.
#' @param verbose Print messages.
#' @inheritParams MungeSumstats::get_genome_build
#'
#' @return ref_genome the genome build of the data
#'
#' @examples
#' #Pass path to Educational Attainment Okbay sumstat file to a temp directory
#'
#' eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt",
#' package="MungeSumstats")
#'
#' ## Call uses reference genome as default with more than 2GB of memory,
#' ## which is more than what 32-bit Windows can handle so remove certain checks
#' is_32bit_windows <-
#' .Platform$OS.type == "windows" && .Platform$r_arch == "i386"
#' if (!is_32bit_windows) {
#' ref_genome <- infer_genome_builds(sumstats=eduAttainOkbayPth)
#'}
#' @export
#' @importFrom MungeSumstats get_genome_build
#' @importFrom parallel mclapply
#' @importFrom utils capture.output
#' @importFrom dplyr %>%
infer_genome_builds <- function(sumstats,
header_only=TRUE,
sampled_snps=10000,
nCores=1,
verbose=TRUE){
start <- Sys.time()
sumstats <- unique(sumstats)
messager("Inferring genome build of",length(sumstats),
"sumstats file(s).",
v=verbose)
#### Get basename ####
munged_ids <- gsub(paste(MungeSumstats:::supported_suffixes(),collapse = "|"),
"",basename(sumstats))
#### Infer builds ####
builds <- parallel::mclapply(sumstats,
function(x,
.sampled_snps=sampled_snps){
message_parallel(x)
MungeSumstats::get_genome_build(sumstats = x,
sampled_snps = .sampled_snps,
nThread = 1)
}, mc.cores = nCores) %>% `names<-`(munged_ids)
#### Report time ####
messager(utils::capture.output(difftime(Sys.time(), start)))
#### Report build counts ####
build_counts <- table(unlist(builds))
for(i in seq(1,length(build_counts))){
messager(names(build_counts)[i],":",build_counts[i],"file(s)",v=verbose)
}
return(builds)
}
header_only
now implemented in get_genome_build
. Confirmed using the example data (eduAttainOkbayPth
) that both methods predict "GRCH37".
get_genome_builds
now implemented as well (renamed from infer_genome_builds
to make more consistent with subfunction)
Probably don't need to export both get_genome_build
and get_genome_builds
, esp since this might get confusing for users.
Will export the more comprehensiveget_genome_builds
only.
get_genome_build
is super useful but can be quite slow, since we currently load the entire sumstats file in, and then downsample the SNPs from that.This doesn't make a difference when we're running the full
format_sumstats
pipeline bc the sumstats are already loaded into memory anyway. But if you're feeding in a path, it way to speed things up is to read in only the top N rows and just use those as the sample (instead of sampling them genome-wide). This may or may not be as accurate as the genome-wide sampling, but I can test that.