Closed bschilder closed 3 years ago
Sorting now made faster by taking advantage of data.table funcs:
#' Sort sum stats by genomic coordinates
#'
#' @param sumstats_dt data table obj of the summary statistics file for the GWAS
#' @param sort_coords Whether to sort by coordinates.
#' @keywords internal
#' @importFrom dplyr %>% arrange
sort_coords <- function(sumstats_dt,
sort_coordinates=TRUE){
if(sort_coordinates){
message("Sorting coordinates")
### setorderv is much more efficient than dplyr::arrange
data.table::setorderv(sumstats_dt, c("CHR", "BP"))
# sumstats_sorted <- sumstats_dt %>%
# dplyr::arrange(CHR, BP)
return(sumstats_sorted)
} else { return(sumstats_dt) }
}
Just realized that CHR isn't ordering correctly bc of presence of sex chromosomes, which make the whole column a character vector.
Turn into an ordered factor, sort, then convert back to character to ensure there's no issues with merging later on:
sort_coords <- function(sumstats_dt,
sort_coordinates=TRUE){
if(sort_coordinates){
message("Sorting coordinates")
chr_order <- c(1:22,"x","y")
### Turn CHR into an ordered factor to account for X and Y chroms
sumstats_dt[,CHR:=factor(CHR, levels = chr_order, ordered = T)]
### setorderv is much more efficient than dplyr::arrange
data.table::setorderv(sumstats_dt, c("CHR", "BP"))
### Now set CHR back to character to avoid issues when merging with other dts
sumstats_dt[,CHR:=as.character(CHR)]
return(sumstats_sorted)
} else { return(sumstats_dt) }
}
Currently,
format_sumstats
returns the sum stats sorted by RSIDs (alphabetically). Usually GWAS sum stats files are (ideally) sorted by genomic coordinates (CHR then BP).This is especially important for tabix, which requires all rows to be sorted by coordinates. #7