Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Sort sum stats by genomic coordinates #8

Closed bschilder closed 3 years ago

bschilder commented 3 years ago

Currently, format_sumstats returns the sum stats sorted by RSIDs (alphabetically). Usually GWAS sum stats files are (ideally) sorted by genomic coordinates (CHR then BP).

This is especially important for tabix, which requires all rows to be sorted by coordinates. #7

bschilder commented 3 years ago

Sorting now made faster by taking advantage of data.table funcs:

#' Sort sum stats by genomic coordinates
#'
#' @param sumstats_dt data table obj of the summary statistics file for the GWAS
#' @param sort_coords Whether to sort by coordinates.
#' @keywords internal 
#' @importFrom dplyr %>% arrange 
sort_coords <- function(sumstats_dt,
                        sort_coordinates=TRUE){
    if(sort_coordinates){
        message("Sorting coordinates")
        ### setorderv is much more efficient than dplyr::arrange
        data.table::setorderv(sumstats_dt, c("CHR", "BP")) 
        # sumstats_sorted <- sumstats_dt %>%
        #     dplyr::arrange(CHR, BP)
        return(sumstats_sorted)
    } else { return(sumstats_dt) }  
}
bschilder commented 3 years ago

Issue

Just realized that CHR isn't ordering correctly bc of presence of sex chromosomes, which make the whole column a character vector.

Solution

Turn into an ordered factor, sort, then convert back to character to ensure there's no issues with merging later on:

sort_coords <- function(sumstats_dt,
                        sort_coordinates=TRUE){  
    if(sort_coordinates){
        message("Sorting coordinates")
        chr_order <- c(1:22,"x","y")
        ### Turn CHR into an ordered factor to account for X and Y chroms
        sumstats_dt[,CHR:=factor(CHR, levels = chr_order, ordered = T)] 
        ### setorderv is much more efficient than dplyr::arrange 
        data.table::setorderv(sumstats_dt, c("CHR", "BP"))
        ### Now set CHR back to character to avoid issues when merging with other dts
        sumstats_dt[,CHR:=as.character(CHR)]  
        return(sumstats_sorted)
    } else { return(sumstats_dt) }  
}