Open lima1 opened 4 years ago
The new query_variants_calls function is nice and does almost all I need. This is the code I came up with:
library(genomicsdb)
library(jsonlite)
reference <- "references/hg38/Reference/sequence.fa"
workspace <- "gatk4_pon_db/"
db <- genomicsdb::connect(workspace = workspace,
vid_mapping_file = file.path(workspace, "vidmap.json"),
callset_mapping_file=file.path(workspace, "callset.json"),
reference_genome = reference,
c("DP", "AD", "AF"))
j <- jsonlite::read_json(file.path(workspace, "callset.json"))
query <- query_variant_calls(db, array="1$3682058$243843487", column_ranges=list(c(3682058,243843487)), row_ranges=list(range(sapply(j$callsets, function(x) x$row_idx))))
head(query)
ROW COL SAMPLE CHROM POS END REF AD AF ALT DP
1 1 3682335 LIB-020359pc4 1 3682336 3682336 G 617 0.430 [A] 1105
2 4 3682335 LIB-020362pc4 1 3682336 3682336 G 626 0.445 [A] 1166
3 8 3682335 LIB-020366pc4 1 3682336 3682336 G 475 0.433 [A] 891
4 9 3682335 LIB-020367pc4 1 3682336 3682336 G 766 0.325 [A] 1191
One issue I have is that Mutect2 generates a DP field in both FORMAT and INFO, one is the unfiltered one the filtered. Looks like the unfiltered is picked. That's fine, but AD is a vector with the filtered REF and ALT reads, but only the REF reads make it into the data.frame. That's a bug, right? So I cannot get the filtered ALT count from the data.frame.
And quick question, is there an easy way to cycle through the available arrays + columns, say in batches of 100,000 columns? My database is small, so reading by chromosome is fine, but wondering if there is a cleaner way.
Best, Markus
Update iterate through all arrays:
workspace <- normalizePath(workspace, mustWork = TRUE)
db <- genomicsdb::connect(workspace = workspace,
vid_mapping_file = file.path(workspace, "vidmap.json"),
callset_mapping_file=file.path(workspace, "callset.json"),
reference_genome = reference,
c("DP", "AD", "AF"))
jcallset <- jsonlite::read_json(file.path(workspace, "callset.json"))
jvidmap <- jsonlite::read_json(file.path(workspace, "vidmap.json"))
# get all available arrays
arrays <- sapply(dir(workspace, full.names=TRUE), file.path, "genomicsdb_meta_dir")
arrays <- basename(names(arrays)[which(file.exists(arrays))])
# get offsets and lengths
contigs <- sapply(arrays, function(ary) strsplit(ary, "\\$")[[1]][1])
contigs <- jvidmap$contigs[match(contigs, sapply(jvidmap$contigs, function(x) x$name))]
idx <- order(rankSeqlevels(sapply(contigs, function(x) x$name)))
all<- lapply(idx, function(i) {
# avoid integer overflow
c_offset <- as.numeric(contigs[[i]]$tiledb_column_offset)
c_length <- as.numeric(contigs[[i]]$length)
flog.info("Processing %s (offset %.0f, length %.0f)...",
arrays[i], c_offset, c_length)
query <- data.table(genomicsdb::query_variant_calls(db,
array = arrays[i],
column_ranges = list(c(c_offset, c_offset + c_length)),
row_ranges = list(range(sapply(jcallset$callsets,
function(x) x$row_idx)))))
})
@lima1, sorry I was not watching this repository. Just wanted to let you know this is work in progress. Please let me know if you have solved your issue. Otherwise I will get back with answers soon.
And quick question, is there an easy way to cycle through the available arrays + columns, say in batches of 100,000 columns? My database is small, so reading by chromosome is fine, but wondering if there is a cleaner way.
Are you open to using spark or mpi or any parallel processing paradigm?
Thanks @nalinigans . The code works well enough for my purpose, but would be curious to hear your suggestions once the code matured! Thanks for all your work, it’s really nice.
Hi Nalini,
I was playing around with this package. Thanks a lot providing these R bindings! Much appreciated.
Do you have by any chance an example for iterating through a GATK4 GenomicsDBImport database?
I'm trying to build something similar to CreateSomaticPanelOfNormals which tries to fit beta-binomial distributions to all heterozygous variants. So simply extracting the AD field for all variants and all samples.
I'll figure it out if you don't have anything in place, but I thought I ask as it seems like a fairly standard task.
Thanks a lot, Markus