MRCIEU / opengwas-reports

Report module for IEU GWAS pipeline
1 stars 0 forks source link

Instead of requiring metadata.json, just generate it from the bcf header #1

Closed explodecomputer closed 5 years ago

explodecomputer commented 5 years ago

Suggested example script to do this (e.g. called header_to_json.r:

library(jsonlite)
library(dplyr)

header_to_json <- function(filename)
{
    tf <- tempfile()
    cmd <- paste0("bcftools view -h ", filename, " | grep -E '^##args|^##counts' | cut -d '#' -f 3 > ", tf)
    system(cmd)
    a <- scan(tf, what=character(), sep="\n")

    o <- lapply(c("args", "counts"), function(x){
        x1 <- grep(paste0("^", x, "."), a, value=TRUE) %>% gsub(paste0("^", x, "."), "", .) %>% strsplit(., split="=") %>% do.call("rbind", .)
        x2 <- as.list(x1[,2])
        names(x2) <- x1[,1]
        x2 %>% as.data.frame(., stringsAsFactors=FALSE)
    })
    names(o) <- c("args", "counts")
    o$counts <- lapply(o[["counts"]], as.numeric) %>% as.data.frame %>% unbox
    o %>% toJSON
    write_json(o, path=gsub("bcf$", "json", filename))
}

header_to_json(commandArgs(T)[1])

To run:

Rscript header_to_json harmonised.bcf

This creates harmonised.json (overwrites, beware). Output:

{
  "args": [
    {
      "dbsnp_field": "1",
      "delimiter": "\t",
      "ea_af_field": "4",
      "ea_field": "2",
      "effect_field": "5",
      "gwas_file": "../studies/2/elastic.gz",
      "gzipped": "1",
      "mrbase_id": "2",
      "n_field": "8",
      "nea_field": "3",
      "out": "../studies/2/harmonised",
      "out_type": "bcf",
      "pval_field": "7",
      "ref_build": "b37",
      "ref_file": "../../vcf-reference-datasets/1000g/1kg_v3_nomult.bcf",
      "ref_info": "1000 genomes phase 3 variants generated using https://github.com/MRCIEU/vcf-reference-datasets",
      "se_field": "6",
      "skip": "0"
    }
  ],
  "counts": {
    "total_variants": 2555510,
    "variants_not_read": 14,
    "variants_with_missing_stats": 0,
    "variants_with_missing_pvals": 0,
    "switched_alleles": 1210530,
    "flipped_alleles_basic": 6783,
    "flipped_alleles_palindrome": 0,
    "incompatible_alleles": 8978,
    "harmonised_variants": 2527179,
    "variants_not_harmonised": 8978,
    "total_remaining_variants": 2527179
  }
}

Requires the bcftools binary on the PATH.

YiLiu6240 commented 5 years ago

@explodecomputer the sections {args, counts} are not present in the harmonised.bcf example that I have (MRC-IEU-research/projects/IEU2/P4/013/working/data/bcfs/harmonised.bcf) -- is there a newer version of this example file?

These headers are not present in the files from @elswob as well.

explodecomputer commented 5 years ago

Thanks @YiLiu6240

I will discuss with Matt how to ensure that we have the same headers through the two different systems. But in the meantime perhaps you could generate like this instead

header_to_json <- function(filename)
{
    tf <- tempfile()
    cmd <- paste0("bcftools view -h ", filename, " | grep -E '^##' | cut -d '#' -f 3 > ", tf)
    system(cmd)
    a <- scan(tf, what=character(), sep="\n")

    x1 <- regmatches(a, regexpr("=", a), invert = TRUE) %>% do.call("rbind", .)
    x2 <- as.list(x1[,2])
    names(x2) <- x1[,1]
    x2 <- as.data.frame(x2, stringsAsFactors=FALSE)
    x2 <- lapply(x2, function(x)
    {
        o <- as.numeric(x)
        ifelse(is.na(o), x, o)
    }) %>% as.data.frame %>% unbox
    x2 %>% toJSON
    write_json(o, path=gsub("bcf$", "json", filename))
}

I'm not sure if the report module is dependent on any specific parameters but if there is anything then do let me know!

YiLiu6240 commented 5 years ago

In the current version we extract header from the bcf file to a json which is then read by rmarkdown (and can be used by other aggregation scripts).