PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
258 stars 71 forks source link

How to extract high-quality SVs from VCF file? #610

Closed dr-ashu-geno closed 1 year ago

dr-ashu-geno commented 1 year ago

Dear developers,

Thank you for developing the tool.

I have run the GRIDSS on my samples, and used simple-event-annotation.R script to annotate the SVs and extract the SVs that pass the filters. However, I don't know how to extract those high-quality SVs (in the BED file) from the VCF file.

Could you help me with that? is there any way to extract PASS SVs, and send them to a VCF file with the same information as the original VCF file? OR isn't it possible to output PASS SVs into a VCF file instead of a BED file since the beginning?

Thanks in advance, Best,

alexiswl commented 1 year ago

@dr-ashu-geno you could probably use bcftools to collect the pass files from the VCF file,

bcftools view -f PASS input.vcf > output.vcf

I've provided a link to a biostars question that may be of use - https://www.biostars.org/p/206488/#423609

dr-ashu-geno commented 1 year ago

Dear Alexis (@alexiswl)

Thank you so much for the quick reply. Actually, I had tried BCFtools before and I'm not sure if it is a good idea. For example, in my primary data for one of the samples, after using the R script and annotating the SV types, the following two variants are reported as two insertions:

Screen Shot 2023-01-14 at 16 54 00

if I extract PASSed SVs using R script and send them to the BED file, those two insertions will be reported as one (which I believe is correct); this is how they appear as one insertion in bed file:

Screen Shot 2023-01-14 at 16 54 50

But if I use bcftools to extract PASSed variants and send them to a VCF file with the command you mentioned, they are reported as two insertions in the VCF file:

Screen Shot 2023-01-14 at 16 55 29

This will result in the wrong number of SVs in each sample, right?

Is there a way to report the PASSed SVs in a VCF file instead of BED file using the same R script?

Thanks in advance, Best,

alexiswl commented 1 year ago

@dr-ashu-geno it may be worth showing what you are expecting in the final VCF? Do you expect the chr1/84006 and chr1/84007 VCF entries to be merged? If so, what would that VCF line look like?

dr-ashu-geno commented 1 year ago

Yes I expect them to be merged because due to having the same information and format, I believe they represent one SV (an insertion).

*If these two entries are not representing the same variant, why are they reported as one SV in the BED file made by the R script available on GRIDSS GitHub page? and if they are not the same, and as both have PASSed filters, why aren't both of them shown in BED file?

alexiswl commented 1 year ago

I think it's worth reading the output section https://github.com/PapenfussLab/gridss#output as to why the VCF is represented this way.

Note that although each record fully defines the call, the VCF format requires both breakends to be written as separate records.

In this case the VCF represents an insertion between the 84006 ('G') and 84007 ('A') on chromosome 1. An insertion is really just a tiny structural variant. The ALT shows that the insertion sequence is AAAGAAGAAAA.

One method could be filtering by the SVLEN tag, in this case the SVLEN value is 12, which is a very small structural variant and probably not what you're after. Filtering the VCF by choosing only entries with a much higher value may provide you with something of greater use.

Also is there a pattern in the ID field, where the o and h suffixes represent the start and end of a variant? Maybe you could select just the o's?

dr-ashu-geno commented 1 year ago

Dear @alexiswl

Thank you so much for the comments. Thank you for referring me to the output section. As I mentioned and it is also written in output section of the GitHub page, each variant is represented with 2 records regardless of the size (this 12-bp insertion was just an example I wanted to show you, all variants in all size ranges are represented with two records). I will extract one from each (those with "o" only).

Thanks once again, Best regards.