Duplicate variants - Githubissues

mcdoz2010 commented 3 years ago

Hello there, I would like to use Mutational Patterns on Illumina data from a DNA virus genome amplified from clinical samples, in order to identify the mutational processes that generate new virus variants in vivo. The sequencing depth is about 10 000x and I use Lofreq to generate the vcf files. In this case, there can be two mutations at the same site, but when the vcf files are loaded into a GRangesList, I get the following error message;

There were 2 duplicated variants in vcf file: .vcf They have been filtered out.

Is there any way I can disable this filtering step?

Thanks for a great package by the way!

Dorian

FreekManders commented 3 years ago

This normally can't be turned off. Here is a version of the reading functions that perform this filtering step, with the filter removed. If you load these functions in R, it should probably work. (I can imagine a function like plot_rainfall might have some issues though.) Can you confirm if this works?


read_vcfs_as_granges <- function(vcf_files,
                                 sample_names,
                                 genome,
                                 group = c("auto+sex", "auto", "sex", "circular", "all", "none"),
                                 type = c("snv", "indel", "dbs", "mbs", "all"),
                                 change_seqnames = TRUE,
                                 predefined_dbs_mbs = FALSE) {

  # Match arguments
  type <- match.arg(type)
  group <- match.arg(group)

  # Check sample names
  if (length(vcf_files) != length(sample_names)) {
    stop("Please provide the same number of sample names as VCF files", call. = FALSE)
  }

  # Get the reference genome
  tryCatch(
    error = function(cnd) {
      stop("Please provide the name of a BSgenome object.", call. = FALSE)
    },
    {
      genome <- BSgenome::getBSgenome(genome)
    }
  )

  # Read vcfs
  grl <- purrr::map(vcf_files, .read_single_vcf_as_grange, genome, group, change_seqnames) %>%
    GenomicRanges::GRangesList()

  # Filter for mutation type
  if (type != "all") {
    grl <- get_mut_type(grl, type, predefined_dbs_mbs)
  }

  # Set the provided names for the samples.
  names(grl) <- sample_names

  return(grl)
}

.read_single_vcf_as_grange <- function(vcf_file, genome, group, change_seqnames) {

  # Use VariantAnnotation's readVcf, but only store the
  # GRanges information in memory.  This speeds up the
  # loading significantly.
  # Muffle the warning about duplicate keys.
  withCallingHandlers(
    {
      gr <- GenomicRanges::granges(VariantAnnotation::readVcf(vcf_file))
    },
    warning = function(w) {
      if (grepl("duplicate keys in header will be forced to unique rownames", conditionMessage(w))) {
        invokeRestart("muffleWarning")
      }
    }
  )

  # Throw a warning when a file is empty.
  # Return a empty GR, to prevent errors with changing the seqlevels.
  if (!length(gr)) {
    warning(paste0(
      "There were 0 variants (before filtering) found in the vcf file: ", vcf_file,
      "\nYou might want to remove this sample from your analysis."
    ), call. = FALSE)
    return(gr)
  }

  # Convert to a single chromosome naming standard.
  if (change_seqnames == TRUE) {
    tryCatch(
      error = function(cnd) {
        message(conditionMessage(cnd))
        stop("The seqlevelStyle of the vcf could not be changed to that of the reference.
                     You can run this function with `change_seqnames = F` and `group = 'none'`, 
                     to prevent this error.
                     However, you then have to make sure that the seqnames (chromosome names) are
                     the same between your vcfs and the reference BSgenome object.
                     (The message of the internal error causing this problem is shown above.)",
          call. = FALSE
        )
      },
      {
        GenomeInfoDb::seqlevelsStyle(gr) <- GenomeInfoDb::seqlevelsStyle(genome)[1]
      }
    )
  }

  # Change the genome name of the granges
  genome_name <- GenomeInfoDb::genome(genome)[[1]]
  GenomeInfoDb::genome(gr) <- genome_name

  # Filter for variants with the correct seqlevels
  if (group != "none") {
    tryCatch(
      error = function(cnd) {
        message(conditionMessage(cnd))
        stop("The vcf could not be filtered for the specific seqlevels group.
                     You can run this function with `group = 'none'`, to prevent this error.
                     (The message of the internal error causing this problem is shown above.)",
          call. = FALSE
        )
      },
      {
        gr <- .filter_seqlevels(gr, group, genome)
      }
    )
  }

  return(gr)
}

.filter_seqlevels <- function(gr, group, genome) {
  groups <- c()
  # These variables are needed to extract the possible seqlevels
  ref_style <- GenomeInfoDb::seqlevelsStyle(genome)
  ref_organism <- GenomeInfoDb::organism(genome)

  if (group == "auto+sex") {
    groups <- c(
      GenomeInfoDb::extractSeqlevelsByGroup(
        species = ref_organism,
        style = ref_style,
        group = "auto"
      ),
      GenomeInfoDb::extractSeqlevelsByGroup(
        species = ref_organism,
        style = ref_style,
        group = "sex"
      )
    )

    # In some cases, the seqlevelsStyle returns multiple styles.
    # In this case, we need to do a little more work to extract
    # a vector of seqlevels from it.
    groups_names <- names(groups)
    if (!is.null(groups_names)) {
      # The seqlevels in the groups are now duplicated.
      # The following code deduplicates the list items, so that
      # creating a data frame will work as expected.
      unique_names <- unique(groups_names)
      groups <- lapply(unique_names, function(x) groups[groups_names == x])
      groups <- lapply(groups, unlist, recursive = FALSE)

      # In case there are multiple styles applied, we only use the first.
      groups <- unique(as.vector(groups[[1]]))
    }
  }
  else {
    groups <- GenomeInfoDb::extractSeqlevelsByGroup(
      species = ref_organism,
      style = ref_style,
      group = group
    )
    groups <- unique(as.vector(t(groups)))
  }

  # The provided VCF files may not contain all chromosomes that are
  # available in the reference genome.  Therefore, we only take the
  # chromosomes that are actually available in the VCF file,
  # belonging to the filter group.
  groups <- BiocGenerics::intersect(groups, GenomeInfoDb::seqlevels(gr))

  # We use 'pruning.mode = "tidy"' to minimize the deleterious effect
  # on variants, yet, remove all variants that aren't in the filter
  # group.  By default, keepSeqlevels would produce an error.
  gr <- GenomeInfoDb::keepSeqlevels(gr, groups, pruning.mode = "tidy")

  return(gr)
}

mcdoz2010 commented 3 years ago

Needed library(magrittr), otherwise the pipe operator wasn't recognized, but it seems to work - the error message doesn't come up any more. Thanks!

FreekManders commented 3 years ago

Glad to hear it works. Please let me know if you run into any further issues.

mcdoz2010 commented 3 years ago

Well my main issue is being a bench biologist screwing around with bioinformatics I don't really understand, but I guess you won't be able to fix that problem!

gevro commented 2 years ago

Hi, I also have the same issue in a different context. Can this simply be added as an option for the read_vcfs_as_granges function?

i.e. a skipduplicatefiltering = c(TRUE,FALSE) option.