PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

̦Problem extracting partner QUAL score from breakpointRange #11

Closed andrwb closed 5 years ago

andrwb commented 5 years ago

I am reading a vcf file generated from gridss-2.0.1-gridss-jar-with-dependencies.jar with a single normal and a tumour sample, the file is read using:

full_vcf = readVcf(vcf_file, "hg19") #(package:VariantAnnotation) bpgr = breakpointRanges(full_vcf)

I am able to view the range set and see the overall QUAL score using:

bpgr

I am then running a command which extracts the QUAL score for vcfId and partner. This works (returning list of QUAL scores) as long as gridss samples loaded in the range file have 'o' or 'h' at the end of the id, but as soon as the data includes samples with a 'b' (which appear split into bp1 and bp2) I get the error message below:

Eg.

geno(full_vcf[bpgr$vcfId])$QUAL sample1 sample2 gridss0_78814b 0.00 0.00 gridss0_78814b 0.00 0.00 gridss0_4721o 476.27 354.68 gridss0_4721h 498.31 363.91 gridss0_16398o 302.49 553.14 gridss0_16398h 274.45 674.59 geno(full_vcf[bpgr$partner])$QUAL Error in SummarizedExperiment:::.SummarizedExperiment.charbound(i, rownames(x), :

[i,] index out of bounds: gridss0_78814b_bp2 gridss0_78814b_bp1

I was wondering if you could advise if this a bug, or if I should be processing the 'b' files differently?

I've attached a sample with 6 entires which fails, but if you remove the last sample, it seems to work gridss.test.txt

Thanks, Andrew

d-cameron commented 5 years ago

There's an upstream issue (https://github.com/Bioconductor/VariantAnnotation/issues/8 ) with processing single breakends. In the meantime, make sure you're using the latest version of StructuralVariantAnnotation and I've used this as the work-around:

readVcf = function(file, ...) {
  raw_vcf = VariantAnnotation::readVcf(file=file, ...)
  # work-around for https://github.com/Bioconductor/VariantAnnotation/issues/8
  alt = read_tsv(file, comment="#", col_names=c("CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", seq_len(ncol(geno(raw_vcf)[[1]]))), cols_only(ALT=col_character()))$ALT
  VariantAnnotation::fixed(raw_vcf)$ALT = CharacterList(lapply(as.character(alt), function(x) x))
  return(raw_vcf)
}
andrwb commented 5 years ago

Thanks for quick reply - unfortunately, on my example, the workaround doesn't appear to fix the problem - in both cases ALT is class character, and normal/workaround appear the same. Any ideas?

According to sessionInfo, i'm using StructuralVariantAnnotation_0.5.6 but it points to the project with description file 0.7.1, and install_github suggests it is up to date & hash hasn't changed since last install.

d-cameron commented 5 years ago

but as soon as the data includes samples with a 'b' (which appear split into bp1 and bp2)

I can't reproduce this behaviour with your input file locally. I suspect you're using an older version of the package. You need to update StructuralVariantAnnotation to a recent version that supports calls in single breakend notation.

d-cameron commented 5 years ago

You should be able to run breakendRanges(full_vcf) to extract the single breakends (these don't have a partner). If the above function doesn't exist, you're definitely running old version of the package.

andrwb commented 5 years ago

Got it working - thanks for the help