PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

Extracting this information from .vcf file #14

Closed beginner984 closed 5 years ago

beginner984 commented 5 years ago

Hi

I have called structural variants for tumour and matched normal samples by two software Delly and Manta; Now I have .vcf as output from which I want to extract such information but long time googling even contacting the developers failed for me

> head(sv_data)
     Sample Type CHROM_1    POS_1 CHROM_2     POS_2 Tumor_Varcount Tumor_Depth Normal_Depth Driver
1   CHC018T  BND    chr1   753734    chr8    327403             22          52          101   <NA>
2   CHC018T  BND    chr1   753747    chr8    327385             17          37          104   <NA>
3   CHC018T  BND    chr1 20344587    chr7   8707723             24         163           63   <NA>

Tumor_Varcount = The number of variant bases at each position in tumour sample

This is my .vcf from delly

https://www.dropbox.com/s/4buzm8n7tebbyki/delly.vcf?dl=0

This is my .vcf from Manta

https://www.dropbox.com/s/a2mjlmeb1in3pms/candidateSV.vcf?dl=0

Any help please

d-cameron commented 5 years ago

Technically, a fragment can both a split read, and a discordant read pair and delly will double-count these fragments. GRIDSS gets corrects for this in it's VF field. I'm not sure about manta.

d-cameron commented 5 years ago

Driver is difficult to determine. I presume this is a cancer cohort. For cancer cohort analysis, I use the hartwig medical pipeline as it does SV + CN analysis well as detecting gene fusions, gene disruptions, and driver predictions. We should have a preprint on this pipeline available in about a mouth, but the pipeline is already complete (and successfully run on a 4,000 patient WGS cohort).

d-cameron commented 5 years ago

start/end positions can be extracted by:

vcf = readVcf("manta.vcf")
bpgr = breakpointRanges(vcf)
svbedpe_df = breakpointgr2bedpe(bpgr)

Note that for SVs, you also need to specify the orientation of the SV at the start/end positions. In this package, this is encoded in the strand of the GRanges object.

d-cameron commented 5 years ago

Note that the latest version of Delly is not VCF compliant and does not write both sides of BND records as it is required to. I have not yet written code to handle these non-compliant SVs.

beginner984 commented 5 years ago

Thank you so much to be this much helpful, now links of .VCF files working

Actually I should find mutational signatures from SV data, so I called them but I don't know how to extract needed information from .VCF; another tool does mutational signature demands such a format

> head(svs.all[1:2,3:17])
  Type Chrom1    Start1      End1 Strand1 Chrom2   Start2     End2 Strand2 Score Filters Read.pairs...tumour
1  BND   chr3  25802405  25802858       +  chr22 36811461 36812184       +    41    PASS                   9
2  BND  chr10 128580463 128581087       +  chr18 25212086 25212571       +    44    PASS                   5
  Read.pairs...normal Split.reads...tumour Split.reads...normal
1                   0                    0                    0
2                   0                    0                    0
>

But I don't know how to get these columns from .VCF

LP2000104-DNA_A01_vs_LP2000101-DNA_A01.SVannotated.txt

d-cameron commented 5 years ago

another tool does mutational signature demands such a format

Which tool is this?

But I don't know how to get these columns from .VCF

Delly and manta store this information in different columns. Open the VCF files with a text edit and have a look at the header definitions.

E.g:

manta_split_read_count = geno(manta_vcf)$SR[,2]