PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
69 stars 14 forks source link

GRanges 'vcfid' and multisample VCF input #31

Closed brucemoran closed 4 years ago

brucemoran commented 4 years ago

Hi,

I have an issue with multisample VCF input to the package.

Best practice in GRIDSS specifies joint calling of all samples creating this input.

I have read a few other issues, #10 and #11 which reference a vcfid column in GRanges from breakpointRanges(vcf), these are older so understandable if things have changed.

This column is not extant, and there is no mention of extracting per-sample ranges in the documentation.

Can you please specify how I can get individual GRanges objects per-sample from a multisample VCF, with multiple tumour?

Thanks,

Bruce

d-cameron commented 4 years ago

This column is not extant

It was renamed to sourceId as StructuralVariantAnnotation supports loading from other sources as well (e.g. BEDPE).

there is no mention of extracting per-sample ranges in the documentation.

A VCF doesn't have per-sample events. It has an all-sample set of events, and a breakdown of the per-sample support for those events is defined in the FORMAT fields (i.e. geno(vcf)).

Can you please specify how I can get individual GRanges objects per-sample from a multisample VCF, with multiple tumour?

You get the GRanges for all calls, then subset it to only variants supported by that sample. How you define support depends on your use case and model. If you've got independent samples then you'd probably want use high confidence threshold used by grids_somatic_filter.R:

library(StructuralVariantAnnotation)
vcf = readVcf("colo829.vcf")
bpgr = breakpointRanges(vcf)
sample_name="COLO829T.bam"
sample_specific_bpgr = bpgr[geno(vcf[bpgr$sourceId])$QUAL[,sample_name] >= 350]

If your tumour samples are related (e.g. multiple mets), then you'll want a much lower threshold so you can look at subclones commons across samples. If you have 5% subclone in one met is found at 100% in another, you'd want to keep those 5% SVs even if they don't pass the high confidence threshold. In these cases, you might want a lower threshold of, say, 2 supporting fragments bpgr[geno(vcf[bpgr$sourceId])$VF[,"COLO829T.bam"] >= 2].

d-cameron commented 4 years ago

Another option is to use gridss_somatic_filter.R included in GRIDSS to convert the raw GRIDSS output into a somatic VCF. This can be done per-sample using the --tumourordinal command line parameter for that script.

brucemoran commented 4 years ago

Thank you very much Daniel for your response, I had indeed considered your second solution but glad to see how to separate out samples from the GRanges object.

And thanks also for your ideas on filtering when working on subclones, this is relevant for our work.

All the best, Bruce

d-cameron commented 4 years ago

In that case, some of the functions in the libgridss.R library that does the heavy lifting for gridss_somatic_filter.R may also be relevant to you. For example gridss_bp_af and gridss_be_af to calculate tumour variant allele fraction. For small events, fragments from the variant haplotype can entirely span the allele so they need a slightly different VAF calculation than large events.