UMCUGenetics / MutationalPatterns

R package for extracting and visualizing mutational patterns in base substitution catalogues
MIT License
104 stars 45 forks source link

Duplicate variants #66

Open mcdoz2010 opened 3 years ago

mcdoz2010 commented 3 years ago

Hello there, I would like to use Mutational Patterns on Illumina data from a DNA virus genome amplified from clinical samples, in order to identify the mutational processes that generate new virus variants in vivo. The sequencing depth is about 10 000x and I use Lofreq to generate the vcf files. In this case, there can be two mutations at the same site, but when the vcf files are loaded into a GRangesList, I get the following error message;

There were 2 duplicated variants in vcf file: .vcf They have been filtered out.

Is there any way I can disable this filtering step?

Thanks for a great package by the way!

Dorian

FreekManders commented 3 years ago

This normally can't be turned off. Here is a version of the reading functions that perform this filtering step, with the filter removed. If you load these functions in R, it should probably work. (I can imagine a function like plot_rainfall might have some issues though.) Can you confirm if this works?


read_vcfs_as_granges <- function(vcf_files,
                                 sample_names,
                                 genome,
                                 group = c("auto+sex", "auto", "sex", "circular", "all", "none"),
                                 type = c("snv", "indel", "dbs", "mbs", "all"),
                                 change_seqnames = TRUE,
                                 predefined_dbs_mbs = FALSE) {

  # Match arguments
  type <- match.arg(type)
  group <- match.arg(group)

  # Check sample names
  if (length(vcf_files) != length(sample_names)) {
    stop("Please provide the same number of sample names as VCF files", call. = FALSE)
  }

  # Get the reference genome
  tryCatch(
    error = function(cnd) {
      stop("Please provide the name of a BSgenome object.", call. = FALSE)
    },
    {
      genome <- BSgenome::getBSgenome(genome)
    }
  )

  # Read vcfs
  grl <- purrr::map(vcf_files, .read_single_vcf_as_grange, genome, group, change_seqnames) %>%
    GenomicRanges::GRangesList()

  # Filter for mutation type
  if (type != "all") {
    grl <- get_mut_type(grl, type, predefined_dbs_mbs)
  }

  # Set the provided names for the samples.
  names(grl) <- sample_names

  return(grl)
}

.read_single_vcf_as_grange <- function(vcf_file, genome, group, change_seqnames) {

  # Use VariantAnnotation's readVcf, but only store the
  # GRanges information in memory.  This speeds up the
  # loading significantly.
  # Muffle the warning about duplicate keys.
  withCallingHandlers(
    {
      gr <- GenomicRanges::granges(VariantAnnotation::readVcf(vcf_file))
    },
    warning = function(w) {
      if (grepl("duplicate keys in header will be forced to unique rownames", conditionMessage(w))) {
        invokeRestart("muffleWarning")
      }
    }
  )

  # Throw a warning when a file is empty.
  # Return a empty GR, to prevent errors with changing the seqlevels.
  if (!length(gr)) {
    warning(paste0(
      "There were 0 variants (before filtering) found in the vcf file: ", vcf_file,
      "\nYou might want to remove this sample from your analysis."
    ), call. = FALSE)
    return(gr)
  }

  # Convert to a single chromosome naming standard.
  if (change_seqnames == TRUE) {
    tryCatch(
      error = function(cnd) {
        message(conditionMessage(cnd))
        stop("The seqlevelStyle of the vcf could not be changed to that of the reference.
                     You can run this function with `change_seqnames = F` and `group = 'none'`, 
                     to prevent this error.
                     However, you then have to make sure that the seqnames (chromosome names) are
                     the same between your vcfs and the reference BSgenome object.
                     (The message of the internal error causing this problem is shown above.)",
          call. = FALSE
        )
      },
      {
        GenomeInfoDb::seqlevelsStyle(gr) <- GenomeInfoDb::seqlevelsStyle(genome)[1]
      }
    )
  }

  # Change the genome name of the granges
  genome_name <- GenomeInfoDb::genome(genome)[[1]]
  GenomeInfoDb::genome(gr) <- genome_name

  # Filter for variants with the correct seqlevels
  if (group != "none") {
    tryCatch(
      error = function(cnd) {
        message(conditionMessage(cnd))
        stop("The vcf could not be filtered for the specific seqlevels group.
                     You can run this function with `group = 'none'`, to prevent this error.
                     (The message of the internal error causing this problem is shown above.)",
          call. = FALSE
        )
      },
      {
        gr <- .filter_seqlevels(gr, group, genome)
      }
    )
  }

  return(gr)
}

.filter_seqlevels <- function(gr, group, genome) {
  groups <- c()
  # These variables are needed to extract the possible seqlevels
  ref_style <- GenomeInfoDb::seqlevelsStyle(genome)
  ref_organism <- GenomeInfoDb::organism(genome)

  if (group == "auto+sex") {
    groups <- c(
      GenomeInfoDb::extractSeqlevelsByGroup(
        species = ref_organism,
        style = ref_style,
        group = "auto"
      ),
      GenomeInfoDb::extractSeqlevelsByGroup(
        species = ref_organism,
        style = ref_style,
        group = "sex"
      )
    )

    # In some cases, the seqlevelsStyle returns multiple styles.
    # In this case, we need to do a little more work to extract
    # a vector of seqlevels from it.
    groups_names <- names(groups)
    if (!is.null(groups_names)) {
      # The seqlevels in the groups are now duplicated.
      # The following code deduplicates the list items, so that
      # creating a data frame will work as expected.
      unique_names <- unique(groups_names)
      groups <- lapply(unique_names, function(x) groups[groups_names == x])
      groups <- lapply(groups, unlist, recursive = FALSE)

      # In case there are multiple styles applied, we only use the first.
      groups <- unique(as.vector(groups[[1]]))
    }
  }
  else {
    groups <- GenomeInfoDb::extractSeqlevelsByGroup(
      species = ref_organism,
      style = ref_style,
      group = group
    )
    groups <- unique(as.vector(t(groups)))
  }

  # The provided VCF files may not contain all chromosomes that are
  # available in the reference genome.  Therefore, we only take the
  # chromosomes that are actually available in the VCF file,
  # belonging to the filter group.
  groups <- BiocGenerics::intersect(groups, GenomeInfoDb::seqlevels(gr))

  # We use 'pruning.mode = "tidy"' to minimize the deleterious effect
  # on variants, yet, remove all variants that aren't in the filter
  # group.  By default, keepSeqlevels would produce an error.
  gr <- GenomeInfoDb::keepSeqlevels(gr, groups, pruning.mode = "tidy")

  return(gr)
}
mcdoz2010 commented 3 years ago

Needed library(magrittr), otherwise the pipe operator wasn't recognized, but it seems to work - the error message doesn't come up any more. Thanks!

FreekManders commented 3 years ago

Glad to hear it works. Please let me know if you run into any further issues.

mcdoz2010 commented 3 years ago

Well my main issue is being a bench biologist screwing around with bioinformatics I don't really understand, but I guess you won't be able to fix that problem!

gevro commented 2 years ago

Hi, I also have the same issue in a different context. Can this simply be added as an option for the read_vcfs_as_granges function?

i.e. a skipduplicatefiltering = c(TRUE,FALSE) option.

FreekManders commented 2 years ago

I'm a bit worried that some functions might have issues with duplicate variants. Could you use the version posted above and test if it works properly with functions like plot_rainfall? If it works well then I can probably add such an option.

gevro commented 2 years ago

Just tested and it works fine.

FreekManders commented 2 years ago

Ok, I now added an option to skip the duplicate filtering. Do note that this can cause plot_rainfall to throw a warning, because the distance between two mutations can become zero, which becomes -Inf after a log transformation. This warning can be safely ignored. Additionally, when identifying DBSs, if a duplicated variant is part of a DBS, then only a single DBS will be formed. I doubt whether this will ever be a problem in practice, but it might be good to keep in mind.

gevro commented 2 years ago

Thanks. How do I update the package with this? Has it been pushed as a new version?

FreekManders commented 2 years ago

Official bioconductor releases are only twice a year, but you can install the package from github with this: https://www.rdocumentation.org/packages/devtools/versions/1.13.6/topics/install_github

FreekManders commented 2 years ago

You can also load the package as described in the README. This will load the newest version of the package as if it is installed, but it won't actually overwrite your currently installed version.

gevro commented 2 years ago

I'm getting an error when I use install_github. It is requiring R 4.2.0. But there is no R 4.2.0... The latest R release is 4.1.2.

> install_github("UMCUGenetics/MutationalPatterns")
Downloading GitHub repo UMCUGenetics/MutationalPatterns@HEAD
Skipping 26 packages ahead of CRAN: zlibbioc, DelayedArray, MatrixGenerics, Rhtslib, BiocParallel, SummarizedExperiment, GenomicAlignments, Rsamtools, GenomeInfoDbData, BiocFileCache, KEGGREST, Biobase, biomaRt, rtracklayer, BiocIO, Biostrings, XVector, AnnotationDbi, GenomicRanges, GenomeInfoDb, IRanges, S4Vectors, BiocGenerics, GenomicFeatures, BSgenome, VariantAnnotation
✔  checking for file ‘/tmp/RtmpjxYpLI/remotes9a2f74904891/UMCUGenetics-MutationalPatterns-cc2687c/DESCRIPTION’ ... OK
─  preparing ‘MutationalPatterns’:
✔  checking DESCRIPTION meta-information ... OK
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
   Omitted ‘LazyData’ from DESCRIPTION
─  building ‘MutationalPatterns_3.5.2.tar.gz’
   Warning in sprintf(gettext(fmt, domain = domain), ...) :
     one argument not used by format 'invalid uid value replaced by that for user 'nobody''
   Warning: invalid uid value replaced by that for user 'nobody'
   Warning in sprintf(gettext(fmt, domain = domain), ...) :
     one argument not used by format 'invalid gid value replaced by that for user 'nobody''
   Warning: invalid gid value replaced by that for user 'nobody'

Installing package into ‘/gpfs/data/elab/bin/Gilad/R/libs’
(as ‘lib’ is unspecified)
ERROR: this R is version 4.1.1, package 'MutationalPatterns' requires R  >= 4.2.0
* removing ‘/gpfs/data/elab/bin/Gilad/R/libs/MutationalPatterns’
Warning message:
In i.p(...) :
  installation of package ‘/tmp/RtmpjxYpLI/file9a2f15a415ec/MutationalPatterns_3.5.2.tar.gz’ had non-zero exit status
FreekManders commented 2 years ago

Oh, right. The development version of bioconductor requires that the development version of R is used, which is currently 4.2. When the next official bioconductor release of the package happens, then R4.2 should be out.

For now, you can just change the required R version in the DESCRIPTION file to an earlier version (line 62). You can download the repository from github, change the DESCRIPTION file and then install from source (https://stackoverflow.com/questions/1474081/how-do-i-install-an-r-package-from-source) or use devtools::load_all()