Extracting this information from .vcf file

beginner984 commented 5 years ago

Hi

I have called structural variants for tumour and matched normal samples by two software Delly and Manta; Now I have .vcf as output from which I want to extract such information but long time googling even contacting the developers failed for me

> head(sv_data)
     Sample Type CHROM_1    POS_1 CHROM_2     POS_2 Tumor_Varcount Tumor_Depth Normal_Depth Driver
1   CHC018T  BND    chr1   753734    chr8    327403             22          52          101   <NA>
2   CHC018T  BND    chr1   753747    chr8    327385             17          37          104   <NA>
3   CHC018T  BND    chr1 20344587    chr7   8707723             24         163           63   <NA>

Tumor_Varcount = The number of variant bases at each position in tumour sample

This is my .vcf from delly

https://www.dropbox.com/s/4buzm8n7tebbyki/delly.vcf?dl=0

This is my .vcf from Manta

https://www.dropbox.com/s/a2mjlmeb1in3pms/candidateSV.vcf?dl=0

Any help please

d-cameron commented 5 years ago

Your dropbox links do not work.
You are asking for a single read depth field for a breakpoint. There are two read depths for a breakpoint - one for each breakend. A SV can be homozygous at one breakend, and heterozygous at the other.
- For example, a chr1-chr2 inter chromosomal translocation, with 0 other copies of chr1 and 1 reference copy of chr2).
There are no standard VCF SV fields for your counts. Look at the VCF header for each of your callers. For delly could code will look like:
```
vcf = readVcf("delly.vcf")
tumour_pairs = geno(vcf)$DR
ref_pairs = geno(vcf)$DV
tumour_reads = geno(vcf)$RR
ref_reads = geno(vcf)$VR
```

Technically, a fragment can both a split read, and a discordant read pair and delly will double-count these fragments. GRIDSS gets corrects for this in it's VF field. I'm not sure about manta.

Determining variant allele fraction from read depth at the location is technically incorrect: you want the coverage across the SV position
- This should also take into account and microhomology. GRIDSS is the only caller that I'm aware of that gets this correct.
SV positions can be uncertain (IMPRECISE VCF field). If you want to ignore all this, specify nominalPosition=TRUE when calling breakpointRanges()

d-cameron commented 5 years ago

Driver is difficult to determine. I presume this is a cancer cohort. For cancer cohort analysis, I use the hartwig medical pipeline as it does SV + CN analysis well as detecting gene fusions, gene disruptions, and driver predictions. We should have a preprint on this pipeline available in about a mouth, but the pipeline is already complete (and successfully run on a 4,000 patient WGS cohort).

d-cameron commented 5 years ago

start/end positions can be extracted by:

vcf = readVcf("manta.vcf")
bpgr = breakpointRanges(vcf)
svbedpe_df = breakpointgr2bedpe(bpgr)

Note that for SVs, you also need to specify the orientation of the SV at the start/end positions. In this package, this is encoded in the strand of the GRanges object.

d-cameron commented 5 years ago

Note that the latest version of Delly is not VCF compliant and does not write both sides of BND records as it is required to. I have not yet written code to handle these non-compliant SVs.

beginner984 commented 5 years ago

Thank you so much to be this much helpful, now links of .VCF files working

Actually I should find mutational signatures from SV data, so I called them but I don't know how to extract needed information from .VCF; another tool does mutational signature demands such a format

> head(svs.all[1:2,3:17])
  Type Chrom1    Start1      End1 Strand1 Chrom2   Start2     End2 Strand2 Score Filters Read.pairs...tumour
1  BND   chr3  25802405  25802858       +  chr22 36811461 36812184       +    41    PASS                   9
2  BND  chr10 128580463 128581087       +  chr18 25212086 25212571       +    44    PASS                   5
  Read.pairs...normal Split.reads...tumour Split.reads...normal
1                   0                    0                    0
2                   0                    0                    0
>

But I don't know how to get these columns from .VCF

LP2000104-DNA_A01_vs_LP2000101-DNA_A01.SVannotated.txt

d-cameron commented 5 years ago

another tool does mutational signature demands such a format

Which tool is this?

But I don't know how to get these columns from .VCF

Delly and manta store this information in different columns. Open the VCF files with a text edit and have a look at the header definitions.

E.g:

manta_split_read_count = geno(manta_vcf)$SR[,2]

PapenfussLab / StructuralVariantAnnotation

Extracting this information from .vcf file #14