Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Read VCFs without #CHR #63

Closed bschilder closed 3 years ago

bschilder commented 3 years ago

When MungeSumstats writes VCF, it doesn't write #CHR, which we currently rely on in read_sumstats/read_vcf. I'm modifying read_vcf so it can still read in VCFs in these situations.

https://github.com/neurogenomics/MungeSumstats/blob/495317fb823077178837a30319f9dfed7884ac1e/R/read_vcf.R#L60

Here's the reprex:

formatted_example <- function(path=system.file("extdata", "eduAttainOkbay.txt",
                                               package = "MungeSumstats")){
    sumstats_dt <- suppressMessages(
        read_sumstats(path = path)
    )
    sumstats_dt <-
       standardise_sumstats_column_headers_crossplatform(
            sumstats_dt = sumstats_dt)$sumstats_dt
    sumstats_dt <- sort_coords(sumstats_dt = sumstats_dt)
    return(sumstats_dt)
}
sumstats_dt <- MungeSumstats:::formatted_example()

path_in <- tempfile(fileext = fileext)
        check <- MungeSumstats:::check_save_path(save_path = path_in, 
                                                 log_folder = tempdir(),
                                                 log_folder_ind = FALSE,
                                                 tabix_index = tabix_index, 
                                                 write_vcf = write_vcf)
        path_in <- check$save_path

        path_out <- MungeSumstats::write_sumstats(
            sumstats_dt = sumstats_dt,
            save_path = path_in,
            write_vcf = write_vcf,
            tabix_index = tabix_index,
            return_path = TRUE
        )
        testthat::expect_true(file.exists(path_out))
        dat <- MungeSumstats::read_sumstats(
            path_out, 
            standardise_headers = standardise_headers)
 Error in data.table::fread(path, nThread = nThread, sep = "\t", skip = "#CHR",  : 
  skip='#CHR' not found in input (it is case sensitive and literal; i.e., no patterns, wildcards or regex) 
bschilder commented 3 years ago

Added several new subfunctions for handling this. These actually do a pretty good job of parsing the VCF, so we might want to consider eventually transitioning to this dedicated VCF parsing tool (VariantAnnotation). But I made this method a backup for now bc I wasn't sure if there would be downstream consequences I hadn't thought of.

https://github.com/neurogenomics/MungeSumstats/blob/bschilder_dev/R/read_vcf_data.R https://github.com/neurogenomics/MungeSumstats/blob/bschilder_dev/R/vcf2df.R

I also improved get_vcf_sample_ids so that it searches for sample names regardless of whether the path is a MRC IEU URL.

https://github.com/neurogenomics/MungeSumstats/blob/bschilder_dev/R/get_vcf_sample_ids.R

bschilder commented 3 years ago

Also, tried to cover a lot of potential scenarios with some new tests for write_sumstats: https://github.com/neurogenomics/MungeSumstats/blob/bschilder_dev/tests/testthat/test-write_sumstats.R