Example code for reading GATK4 GenomicsDBImport

lima1 commented 4 years ago

Hi Nalini,

I was playing around with this package. Thanks a lot providing these R bindings! Much appreciated.

Do you have by any chance an example for iterating through a GATK4 GenomicsDBImport database?

I'm trying to build something similar to CreateSomaticPanelOfNormals which tries to fit beta-binomial distributions to all heterozygous variants. So simply extracting the AD field for all variants and all samples.

I'll figure it out if you don't have anything in place, but I thought I ask as it seems like a fairly standard task.

Thanks a lot, Markus

lima1 commented 4 years ago

The new query_variants_calls function is nice and does almost all I need. This is the code I came up with:

library(genomicsdb)
library(jsonlite)
reference <- "references/hg38/Reference/sequence.fa"

workspace <- "gatk4_pon_db/"

db <- genomicsdb::connect(workspace = workspace,
    vid_mapping_file = file.path(workspace, "vidmap.json"),
    callset_mapping_file=file.path(workspace, "callset.json"),
    reference_genome = reference,
    c("DP", "AD", "AF"))

j <- jsonlite::read_json(file.path(workspace, "callset.json"))

query <- query_variant_calls(db, array="1$3682058$243843487", column_ranges=list(c(3682058,243843487)), row_ranges=list(range(sapply(j$callsets, function(x) x$row_idx))))
head(query)
  ROW     COL        SAMPLE CHROM     POS     END REF  AD    AF ALT   DP
1   1 3682335 LIB-020359pc4     1 3682336 3682336   G 617 0.430 [A] 1105
2   4 3682335 LIB-020362pc4     1 3682336 3682336   G 626 0.445 [A] 1166
3   8 3682335 LIB-020366pc4     1 3682336 3682336   G 475 0.433 [A]  891
4   9 3682335 LIB-020367pc4     1 3682336 3682336   G 766 0.325 [A] 1191

One issue I have is that Mutect2 generates a DP field in both FORMAT and INFO, one is the unfiltered one the filtered. Looks like the unfiltered is picked. That's fine, but AD is a vector with the filtered REF and ALT reads, but only the REF reads make it into the data.frame. That's a bug, right? ~~So I cannot get the filtered ALT count from the data.frame~~.

And quick question, is there an easy way to cycle through the available arrays + columns, say in batches of 100,000 columns? My database is small, so reading by chromosome is fine, but wondering if there is a cleaner way.

Best, Markus

Update iterate through all arrays:

    workspace <- normalizePath(workspace, mustWork = TRUE)
    db <- genomicsdb::connect(workspace = workspace,
        vid_mapping_file = file.path(workspace, "vidmap.json"),
        callset_mapping_file=file.path(workspace, "callset.json"),
        reference_genome = reference,
        c("DP", "AD", "AF"))

    jcallset <- jsonlite::read_json(file.path(workspace, "callset.json"))
    jvidmap <- jsonlite::read_json(file.path(workspace, "vidmap.json"))

    # get all available arrays
    arrays <- sapply(dir(workspace, full.names=TRUE), file.path, "genomicsdb_meta_dir")
    arrays <- basename(names(arrays)[which(file.exists(arrays))])
    # get offsets and lengths
    contigs <- sapply(arrays, function(ary) strsplit(ary, "\\$")[[1]][1])
    contigs <- jvidmap$contigs[match(contigs, sapply(jvidmap$contigs, function(x) x$name))]
    idx <- order(rankSeqlevels(sapply(contigs, function(x) x$name)))

    all<- lapply(idx, function(i) {
        # avoid integer overflow
        c_offset <- as.numeric(contigs[[i]]$tiledb_column_offset)
        c_length <- as.numeric(contigs[[i]]$length)

        flog.info("Processing %s (offset %.0f, length %.0f)...",
            arrays[i], c_offset, c_length)
        query <- data.table(genomicsdb::query_variant_calls(db,
            array = arrays[i],
            column_ranges = list(c(c_offset, c_offset + c_length)),
            row_ranges = list(range(sapply(jcallset$callsets,
                function(x) x$row_idx)))))
    })

nalinigans commented 4 years ago

@lima1, sorry I was not watching this repository. Just wanted to let you know this is work in progress. Please let me know if you have solved your issue. Otherwise I will get back with answers soon.

And quick question, is there an easy way to cycle through the available arrays + columns, say in batches of 100,000 columns? My database is small, so reading by chromosome is fine, but wondering if there is a cleaner way.

Are you open to using spark or mpi or any parallel processing paradigm?

lima1 commented 4 years ago

Thanks @nalinigans . The code works well enough for my purpose, but would be curious to hear your suggestions once the code matured! Thanks for all your work, it’s really nice.

GenomicsDB / GenomicsDB-R

Example code for reading GATK4 GenomicsDBImport #12