UMCUGenetics / MutationalPatterns

R package for extracting and visualizing mutational patterns in base substitution catalogues
MIT License
104 stars 45 forks source link

Error: There are some REF or ALT bases, that do not belong to any of the categories. #56

Closed pdiakumis closed 3 years ago

pdiakumis commented 3 years ago

Thanks for the great package!

We have encountered an issue with one of our COLO829 test samples. It seems that after running get_mut_type(gr, type = "dbs"), it keeps a GGG in the REF column. This then becomes problematic when hitting the .count_dbs_contexts_gr function in https://github.com/UMCUGenetics/MutationalPatterns/blob/master/R/count_dbs_contexts.R#L106:

require(MutationalPatterns)
ref_genome <- "BSgenome.Hsapiens.UCSC.hg38"
require(ref_genome, character.only = TRUE)

vcf <- "COLO829_snv.vcf.gz"
gr <- MutationalPatterns::read_vcfs_as_granges(
  vcf_files = vcf,
  sample_names = "COLO829",
  genome = ref_genome,
  group = "auto+sex",
  type = "all")

gr_dbs <- MutationalPatterns::get_mut_type(vcf_list = gr, type = "dbs")
table(sapply(gr_dbs$COLO829$REF, function(x) as.character(x)))

## AA  AC  AG  AT  CA  CC  CG  CT  GA  GC  GG GGG  GT  TA  TC  TG  TT
##  9  20  27   6   9 261   1  18  17   2 300   1  18   3  26   7  13

gr_dbs <- MutationalPatterns::get_dbs_context(vcf_list = gr_dbs)
dbs_counts <- MutationalPatterns::count_dbs_contexts(vcf_list = gr_dbs)

## Error: There are some REF or ALT bases, that do not belong to any of the categories.
## Did you forget to use 'get_dbs_context()'?

I'm not sure if there's an issue with the .split_mbs_gr recursive function. I'm attaching the raw VCF file in case you have time to reproduce (there's 39,023 mutations in there so it takes a little bit to run (~1-2min). I'll disable this dbs step in our pipeline for the time being.

COLO829_snv.vcf.gz

FreekManders commented 3 years ago

Hi @pdiakumis,

I have managed to reproduce the error (see modified script below). It seems that your vcf already contained a called DBS at chr5:1295113-1295114. The recursive function assumes that DBS variants have been called as neighboring SNVs and merges them into DBS/MBS variants. This specific DBS was already called as such and since it neighbored a SNV, it was merged into a 3-base long MBS. If your pipeline already identifies DBS/MBS variants, then you can use the predefined_dbs_mbs argument to prevent neighouring SNVs from being merged into DBS/MBS variants. (This argument is probably not yet in the current bioconductor release, but is available on the github version).

Interestingly, this variant on chromosome 5 is the only variant already called as a DBS. Which is quite a low number if your pipeline is calling DBS variants. If your pipeline normally doesn't identify DBS/MBS variants then it would be interesting to know how it managed to call this one variant.

ref_genome <- "BSgenome.Hsapiens.UCSC.hg38"
library(ref_genome, character.only = TRUE)
load_all("~/surfdrive/Shared/Boxtel_General/Scripts/Git_submission/Freek_MutationalPatterns/MutationalPatterns/")
setwd("~/Downloads/")

# Read vcf
vcf <- "COLO829_snv.vcf.gz"
gr <- MutationalPatterns::read_vcfs_as_granges(
    vcf_files = vcf,
    sample_names = "COLO829",
    genome = ref_genome,
    group = "auto+sex",
    type = "all")
gr = gr[[1]]

# Get mut type
gr_dbs <- MutationalPatterns::get_mut_type(vcf_list = gr, type = "dbs")
table(sapply(gr_dbs$REF, function(x) as.character(x)))

# Find variants causing issue.
bad_region = gr_dbs[ elementNROWS(gr_dbs$REF) != 2]
hits = findOverlaps(bad_region, gr)
gr[subjectHits(hits)]

# How many DBS are present in the gr before merging dbs?
which(elementNROWS(gr$REF) == 2 & elementNROWS(unlist(gr$ALT)) == 2)
ohofmann commented 3 years ago

Thanks for looking into this! It's an ensemble calling approach so unless at least two callers agree on the individual SNVs and them merge them in the same way you are not doing to see DBS in the final call set; I suspect that is part of the reason why these are rare. Relatively sure we don't do a variant normalisation / merging after the ensemble step.

FreekManders commented 3 years ago

Okay, in that case I guess you could remove any DBS/MBS variants before using get_mut_type and them add them back in later. This should be pretty easy to do with something like:

dbs_i = elementNROWS(gr$REF) == 2 & elementNROWS(unlist(gr$ALT)) == 2
dbs_predetermined_gr = gr[dbs_i]
other_gr = gr[!dbs_i]
dbs_merged_gr = get_mut_type(other_gr, type = "dbs")
dbs_gr = c(dbs_merged_gr, dbs_predetermined_gr)

You will probably need to extend this to mbs though and add a check that ensures dbs_i isn't empty.

pdiakumis commented 3 years ago

@FreekManders thanks very much for the snippet! I'll close this issue since it is data/pipeline-specific from our side. Cheers - Peter