Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

Populate INFO column with VariantAnnotation::writeVcf #61

Open Al-Murphy opened 2 years ago

Al-Murphy commented 2 years ago

Hi ,

I'm the creator of the MungeSumstats Bioconductor package in which we use VariantAnnotation::writeVcf as an output options for the formatted sumstats. I want to align the formatting of our sumstats to IEU OpenGWAS VCF, part of which means adding the RSIDs to the INFO column in the VCF. However, I can't seem to populate INFO with VariantAnnotation::writeVcf. Is there any way to do this?

An example:

#get example data 
sumstats_dt <- MungeSumstats::formatted_example()
#convert to VRanges
gr <- to_granges(sumstats_dt)
message("Converting summary statistics to VRanges.")
gr$dummy <- "GWAS"
vr <- VariantAnnotation::makeVRangesFromGRanges(gr,
    ref.field = "A1", 
    alt.field = "A2",
    keep.extra.columns = TRUE,
    sampleNames.field = "dummy"
)
#make SNP field into INFO
vr$INFO <- paste0("RSID=",vr$SNP)
vr$SNP <- NULL
#Output as VCF
VariantAnnotation::writeVcf(
            obj = vr,
            ### Must supply filename without compression suffix
            filename ="~/Downloads/test_vcf.vcf",
            index = FALSE
        )
#inspect
print(readLines("~/Downloads/test_vcf.vcf",100))
#INFO not populated
vjcitn commented 2 years ago

Have you tried running your example, but preceding this with

debug(VariantAnnotation:::.makeVcfMatrix) # note the :::

to get clues as to why the INFO data are lost? Please consider making a PR if you find the answer.

Al-Murphy commented 2 years ago

Hey Vince!

Tried to look into this but can't seem to install the 1.99.3+ version of Rhtslib on mac, have you come across this issue?

But from what I can tell without getting into the details of Variantannotation, when writeVcf is called on a VRanges object I believe it's the below function the one that's called from here:

setMethod(writeVcf, c("VCF", "connection"),
    function(obj, filename, index = FALSE, ...)
{
    if (!isTRUEorFALSE(index))
        stop("'index' must be TRUE or FALSE")

    if (!isOpen(filename)) {
        open(filename)
        on.exit(close(filename))
    }

    scon <- summary(filename)
    headerNeeded <- !(file.exists(scon$description) &&
                      file.info(scon$description)$size !=0) 

    if (headerNeeded) {
        hdr <- .makeVcfHeader(obj)
        writeLines(hdr, filename)
    }

    if (index)
        obj <- sort(obj)

    if (all(is.na(idx <- .chunkIndex(dim(obj)[1L], ...))))
        .makeVcfMatrix(filename, obj)
    else
        for (i in idx)
            .makeVcfMatrix(filename, obj[i])
    flush(filename)

    if (index) {
        filenameGZ <- bgzip(scon$description, overwrite = TRUE)
        indexTabix(filenameGZ, format = "vcf")
        unlink(scon$description)
        invisible(filenameGZ)
    } else {
        invisible(scon$description)
    }
})

Then the call VariantAnnotation:::.makeVcfMatrix uses the line INFO <- .makeVcfInfo(info(obj), length(rd)) to get the info field where the function info() is defined as:

setMethod("info", "VCFHeader", 
    function(x) 
{
    info <- slot(x, "header")$INFO
    if (is.null(info))
        info <- DataFrame(Number=integer(), Type=character(),
                          Description=character())
    info
})

The issue is that VRanges don't have a slot(x, "header") value so I don't think it would ever pick it up? Unless I misinterpreted the function calls here?

Al-Murphy commented 2 years ago

Wait I figured a way around it:

VariantAnnotation::writeVcf(
  obj = VariantAnnotation::asVCF(vr,info='RSID'),
  ### Must supply filename without compression suffix
  filename ="~/Downloads/test_vcf.vcf",
  index = FALSE
)

Just converting to VCF class yourself and specifying the column to go into info in the writeVcf call works

vjcitn commented 2 years ago

Thank you for persevering. If you see a documentation enhancement that will prevent other users from hitting your problem please make a PR. I think we should leave this open for now.