PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

Sample Names not given in simplebed file #10

Closed nikitagambhir closed 5 years ago

nikitagambhir commented 6 years ago

I ran GRIDSS on 11 samples, 3 of which are control and the rest are treatments. I would like to find which samples have what SV's. Sample information is given in the metadata in the vcf but I cannot trace back which sample had what variant. I am using the following script.

library(VariantAnnotation)
library(stringr)
library(devtools)
library(rtracklayer)
library(StructuralVariantAnnotation)

# Simple SV type classifier
simpleEventType <- function(gr) {
  return(ifelse(seqnames(gr) != seqnames(partner(gr)), "ITX", # inter-chromosomosal
          ifelse(gr$insLen >= abs(gr$svLen) * 0.7, "INS",
           ifelse(strand(gr) == strand(partner(gr)), "INV",
            ifelse(xor(start(gr) < start(partner(gr)), strand(gr) == "-"), "DEL",
             "DUP")))))
}

vcf_iso_1 <- readVcf("/Users/nikitagambhir/Desktop/Research_files/iso_1.sv.vcf", "genome.fasta")
gr_all_iso_1 <- breakpointRanges(vcf_iso_1)
svtype_1 <- simpleEventType(gr_all_iso_1)
info(vcf_iso_1)$SIMPLE_TYPE <- NA_character_
info(vcf_iso_1[gr_all_iso_1$vcfId])$SIMPLE_TYPE <- svtype_1
info(vcf_iso_1[gr_all_iso_1$vcfId])$SVLEN <- gr_all_iso_1$svLen
gr_all_iso_1@metadata <- vcf_iso_1@metadata
writeVcf(vcf_iso_1, "vcf_iso_1.sv.annotated.vcf")

gr_all_iso_1 <- gr_all_iso_1[gr_all_iso_1$FILTER == "." & partner(gr_all_iso_1)$FILTER == "."] # Remove low confidence calls

simplegr_iso_1 <- gr_all_iso_1[simpleEventType(gr_all_iso_1) %in% c("INS", "INV", "DEL", "DUP")]
simplebed_iso_1 <- data.frame(
    chrom=seqnames(simplegr_iso_1),
    start=as.integer((start(simplegr_iso_1) + end(simplegr_iso_1)) / 2),
    end=as.integer((start(partner(simplegr_iso_1)) + end(partner(simplegr_iso_1))) / 2),
    name=simpleEventType(simplegr_iso_1),
    score=simplegr_iso_1$QUAL,
    strand="."
    )
# Just the lower of the two breakends so we don't output everything twice
simplebed_iso_1 <- simplebed_iso_1[simplebed_iso_1$start < simplebed_iso_1$end,]

Attached is the screenshot of simplebed_iso_1

screen shot 2018-05-21 at 12 31 10 pm

It would be great if you could help!

d-cameron commented 6 years ago

If you're working in R (the screenshot indicates you're working in RStudio), then I recommend working directly with the GRanges objects returned from breakpointRanges(). As long you you keep track of the vcfid column in the Granges object, you can pull out all the sample information you need using geno(vcf[gr$vcfid]).

For example, the per-sample variant quality score can be extracted using geno(vcf[gr$vcfid])$QUAL.

nikitagambhir commented 6 years ago

It works, thank you! I have two follow up questions.

  1. It appears that for a particular SV, the quality scores across all samples are added to obtain the final quality score. Although my samples have low quality scores individually, the final score is high enough to pass it through the quality check. So, a sample with zero quality score will not be identified. Should it be modified to call SVs on individual samples?

  2. The expected coverage of each of my samples was 30X, but I found an average coverage of only 15X. Is this sufficient enough for GRIDSS to perform a de novo assembly? Will I still be able to find SVs?

d-cameron commented 5 years ago

Recent GRIDSS releases now have a per-sample QUAL breakdown. QUAL=0 means that SV does not occur in that sample (assuming cross-contamination at a sufficiently low level).