Closed pdiakumis closed 3 years ago
Hi @pdiakumis,
I have managed to reproduce the error (see modified script below). It seems that your vcf already contained a called DBS at chr5:1295113-1295114.
The recursive function assumes that DBS variants have been called as neighboring SNVs and merges them into DBS/MBS variants. This specific DBS was already called as such and since it neighbored a SNV, it was merged into a 3-base long MBS.
If your pipeline already identifies DBS/MBS variants, then you can use the predefined_dbs_mbs
argument to prevent neighouring SNVs from being merged into DBS/MBS variants. (This argument is probably not yet in the current bioconductor release, but is available on the github version).
Interestingly, this variant on chromosome 5 is the only variant already called as a DBS. Which is quite a low number if your pipeline is calling DBS variants. If your pipeline normally doesn't identify DBS/MBS variants then it would be interesting to know how it managed to call this one variant.
ref_genome <- "BSgenome.Hsapiens.UCSC.hg38"
library(ref_genome, character.only = TRUE)
load_all("~/surfdrive/Shared/Boxtel_General/Scripts/Git_submission/Freek_MutationalPatterns/MutationalPatterns/")
setwd("~/Downloads/")
# Read vcf
vcf <- "COLO829_snv.vcf.gz"
gr <- MutationalPatterns::read_vcfs_as_granges(
vcf_files = vcf,
sample_names = "COLO829",
genome = ref_genome,
group = "auto+sex",
type = "all")
gr = gr[[1]]
# Get mut type
gr_dbs <- MutationalPatterns::get_mut_type(vcf_list = gr, type = "dbs")
table(sapply(gr_dbs$REF, function(x) as.character(x)))
# Find variants causing issue.
bad_region = gr_dbs[ elementNROWS(gr_dbs$REF) != 2]
hits = findOverlaps(bad_region, gr)
gr[subjectHits(hits)]
# How many DBS are present in the gr before merging dbs?
which(elementNROWS(gr$REF) == 2 & elementNROWS(unlist(gr$ALT)) == 2)
Thanks for looking into this! It's an ensemble calling approach so unless at least two callers agree on the individual SNVs and them merge them in the same way you are not doing to see DBS in the final call set; I suspect that is part of the reason why these are rare. Relatively sure we don't do a variant normalisation / merging after the ensemble step.
Okay, in that case I guess you could remove any DBS/MBS variants before using get_mut_type
and them add them back in later. This should be pretty easy to do with something like:
dbs_i = elementNROWS(gr$REF) == 2 & elementNROWS(unlist(gr$ALT)) == 2
dbs_predetermined_gr = gr[dbs_i]
other_gr = gr[!dbs_i]
dbs_merged_gr = get_mut_type(other_gr, type = "dbs")
dbs_gr = c(dbs_merged_gr, dbs_predetermined_gr)
You will probably need to extend this to mbs though and add a check that ensures dbs_i isn't empty.
@FreekManders thanks very much for the snippet! I'll close this issue since it is data/pipeline-specific from our side. Cheers - Peter
Thanks for the great package!
We have encountered an issue with one of our COLO829 test samples. It seems that after running
get_mut_type(gr, type = "dbs")
, it keeps aGGG
in the REF column. This then becomes problematic when hitting the.count_dbs_contexts_gr
function in https://github.com/UMCUGenetics/MutationalPatterns/blob/master/R/count_dbs_contexts.R#L106:I'm not sure if there's an issue with the
.split_mbs_gr
recursive function. I'm attaching the raw VCF file in case you have time to reproduce (there's 39,023 mutations in there so it takes a little bit to run (~1-2min). I'll disable this dbs step in our pipeline for the time being.COLO829_snv.vcf.gz