PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
255 stars 71 forks source link

Error in using StructuralVariantAnnotation #107

Closed ytguojian closed 6 years ago

ytguojian commented 6 years ago

Hi, I have used the StructuralVariantAnnotation package to infer the SV type such as DEL DUP et al, but it failed. The R script like this simpleEventType <- function(gr) { return(ifelse(seqnames(gr) != seqnames(partner(gr)), "ITX", # inter-chromosomosal ifelse(gr$insLen >= abs(gr$svLen) * 0.7, "INS", ifelse(strand(gr) == strand(partner(gr)), "INV", ifelse(xor(start(gr) < start(partner(gr)), strand(gr) == "-"), "DEL", "DUP"))))) } vcf <- readVcf("9.sorted.vcf", "Prunus_persica_v2.0.a1_scaffolds") gr <- breakpointRanges(vcf) svtype <- simpleEventType(gr) info(vcf)$SIMPLE_TYPE <- NAcharacter info(vcf[gr$vcfId])$SIMPLE_TYPE <- svtype writeVcf(vcf, "9.sv.annotated.vcf") when I run it, it showed these information below.

using the example in the GRIDSS /example directory

Warning message: In .breakpointRanges(x, ...) : Removing 149 unpaired breakend variants gridss0_16172o, gridss0_20323o, gridss0_20599o, gridss0_7200o, gridss0_20861o, gridss0_12598o, gridss0_18969o, gridss1_20053o, gridss1_1106o, gridss1_7536o, gridss1_9073o, gridss1_9926o, gridss1_10242o, gridss1_21733o, gridss1_18303o, gridss1_3391o, gridss1_22275o, gridss2_16036o, gridss2_19975o, gridss2_1853o, gridss2_11459o, gridss2_21867o, gridss2_3046o, gridss2_13908o, gridss2_14652o, gridss2_14891o, gridss3_20o, gridss3_1649o, gridss3_16574o, gridss3_19652o, gridss3_19758h, gridss4_7434o, gridss4_18013o, gridss4_1937o, gridss4_1964o, gridss4_18617o, gridss4_11683o, gridss4_18911o, gridss5_20891o, gridss4_13437h, gridss5_4319o, gridss5_21343o, gridss5_21491o, gridss5_716o, gridss5_8232o, gridss5_1393o, gridss5_1537o, gridss5_2153h, gridss5_13001h, gridss6_16603o, gridss6_5458o, gridss6_21439o, gridss6_21459o, gridss6_17824o, gridss6_7166o, gridss6_18293o, gridss6_1112o, gridss6_1331o, gridss6_22474o, gridss6_10419h, gridss6_13426o, gridss6_20 [... truncated]

svtype <- simpleEventType(gr) info(vcf)$SIMPLE_TYPE <- NAcharacter Warning message: info fields with no header: SIMPLE_TYPE info(vcf[gr$vcfId])$SIMPLE_TYPE <- svtype Warning message: info fields with no header: SIMPLE_TYPE writeVcf(vcf, "9.sv.annotated.vcf")

The output file 9.sv.annotated.vcf almost the same as original file. Could you help me?

Cheers

d-cameron commented 6 years ago

Does your output file include a SIMPLE_TYPE INFO field annotation for your variants? If so, it looks like it completed successfully (albiet with warnings).

Warning message: In .breakpointRanges(x, ...) : Removing 149 unpaired breakend variants ...

These are likely to be variants that either : a) pass the minimum cut-off threshold at one breakend but not at the otherusually this is due to data inconsistencies. These can come from many sources including non-compliant aligner output or even GATK indel realignment (which modifies read alignments without updating the mate read with the new alignment coordinates/CIGAR). or b) the other side has been filtered out of the VCF. This usually happens when you subset the VCF to a specific range/chromosome (e.g. filter to only chr 9 event removes the partner breakend for all chr 9 inter chromosomal events)

ytguojian commented 6 years ago

Hi Daniel, I have used gridss.sh script to call variants and the output VCF file does not contain SIMPLE_TYPE INFO field. How should I do to annotate the breakpoints? I also tried to annotate them with CLOVE, while the CLOVE required BEDPE format file. I used svtools to convert VCF format to BEDPE format and run CLOVE java -jar /workspace/hrajxg/software/clove-0.17/clove-0.17-jar-with-dependencies.jar -i 9.bedpe BEDPE -b /workspace/hrajxg/bioinf_Prunus_ZFRI/06_picard/9.sorted.dedup.bam -c 25 10 -o 9.clone.vcf

It still fialed Cheers, Jian

d-cameron commented 6 years ago

I have used gridss.sh script to call variants and the output VCF file does not contain SIMPLE_TYPE INFO field. How should I do to annotate the breakpoints?

GRIDSS calls SVs in VCF breakends notation so this is to be expected.

The output file 9.sv.annotated.vcf almost the same as original file.

If you're running /example/simple-event-annotation.R then the output file should be basically the same as the input file but with a SIMPLE_TYPE INFO field added. Is this not the case?

If you're just interested in INS/DEL/INV/DUP events, then the bed output from /example/simple-event-annotation.R:41 would suffice.

ytguojian commented 6 years ago

Hi Daniel, Yes, I make it with /example/simple-event-annotation.R. Thanks a lot. I have another question about the INS type. Are the INS variants detected based on the reference genome and not the novel insert sequence? Because I notice that the insert size is very small. Cheers, Jian

d-cameron commented 6 years ago

Since I'm using a form of reference-guided assembly, insert size is limited by the library fragment length so the detection of only small insertions is expected.

The next version of GRIDSS (currently undergoing final internal testing on the https://github.com/PapenfussLab/gridss/tree/dev branch) will report dangling breakends as well as breakpoints. Large insertions will show up as two separate breakends but since I'm not (yet) doing de novo assembly, they won't be explicitly paired (although if there are two breakends at the same location but in different directions the odds of it being a large insertion are pretty good).

Part of this new capability was driven by our work in the clinical cancer space. Even if you don't know exactly what sequence has been inserted, knowing that there is a structural event at a particular locus is extremely useful clinical information. That said, breakend detection is broadly applicable, hence the inclusion into GRIDSS.

JuanmaMedina commented 5 years ago

Hello Daniel,

Thanks for your comments and quick answers. I have a couple of additional suggestions/comments:

the other side has been filtered out of the VCF. This usually happens when you subset the VCF to a specific range/chromosome

I guess it is then expected to obtain a warning about removed breakend variants when one runs the gridss.sh script blacklisting some regions, such as the provided ENCODE excludable-regions .bed file?

If you're just interested in INS/DEL/INV/DUP events, then the bed output from /example/simple-event-annotation.R:41 would suffice.

A limitation of this approach is that, when running multiple samples, one looses the association information between the SVs and the samples they are detected in. I am currently working on a modification of the script to bypass this issue and get the sample information in a new column of the .bed output.

Best, Juanma