Interpretation of WGS data (epilepsy): Where should I commence?

robertzeibich commented 1 year ago

I am interested in structural variants/copy number variants (SVs/CNVs) of whole genome sequencing data of patients with Epilepsy, but I do not know where to commence with the analysis once the output files have been generated:

Command: cnvkit.py batch /scratch/xm41/ct/bams/*bam -n --method wgs -t ${TARGETS} -f ${FASTA} --annotate ${ANNOT} --scatter --diagram

In which file can I find the SVs/CNVs? What is the best way to start analyzing my files?

Help would be much appreciated.

robertzeibich commented 1 year ago

I made some progress. I first used the provided cnvkit files, but then I thought of a more customized approach:

I created a target_chr.bed file, which includes the genomic coordinates of ca. 500 epilepsy genes.
I used the chop command of bedops to create bp baits of 1000 with no overlap (file: processed_target_chr.bed).
I executed the following command: cnvkit.py batch /scratch/xm41/ct/bams/${BAMFILE} -n --method wgs -t processed_target_chr.bed -f ${FASTA} --annotate target_chr.bed --scatter --diagram -d cnvkit_epilepsy_specific

For my first sample, I get the following scatter plot and pdf: GALF003-scatter

I wonder why the pdf has gain regions with no gene label. Do you know why? Based on my understanding, chr19 and chr21 should be further analyzed.

Please let me know when you think my approach has flaws or makes little sense.

I would highly appreciate a response/your support.

etal commented 1 year ago

In the scatter plot I see some datapoints supporting single-copy losses at the ends of chr1, chr6, and chr7, and possibly chr17 and chr21. It's suspicious that they're at the very end of each chromosome near the telomere, but still probably worth looking into unless you have another explanation why it might be an artifact.

For germline alterations you can focus your attention on log2 ratios close to -1.0 and ignore the ones around -0.5 or above unless you suspect your sample is mosaic.

CNVkit's plots aren't well suited to WGS, but since you've used bedtools to make these equivalent to targeted amplicon sequencing, the analysis commands that work on .cnr files, like genemetrics, can give useful information.

Internally, here's the handling of each sequencing method:

"hybrid" has both high-coverage targets and low-coverage antitargets (off-target regions).
"amplicon" has only high-coverage targets and no off-target reads, so there are no antitarget regions in the output. Hance, there can be huge genomic gaps between coverage datapoints, and while gene-level information is usable (see genemetrics), segmentation is prone to weird artifacts.
"wgs" has neither high-coverage targets nor low-coverage antitargets; it just scans each chromosome end-to-end and relies on gene annotations to extract usable information for reporting. It is not as well tested, and the segmentation methods and plots weren't really designed for this assay, but in principle it ought to work to some extent. For performance reasons, I recommend restricting the scanned regions to genic regions, as you did here.

If your list of genes of interest doesn't match your gene annotations, you're going to have a bad time. Consider checking separately that each gene of interest is in the gene annotations table, and where it isn't, find the matching synonym by looking it up in GeneCards or similar. This is a very normal and painful part of bioinformatics. If you have a list of synonyms you can automate it, but be careful not to swap names that aren't equivalent.

robertzeibich commented 1 year ago

It took me awhile to get back. I am following now a different approach (comparing the output against other SVs/CNVs methods):

BATCH cnvkit.py batch ${BAM} -n --method wgs -f ${FASTA} --target-avg-size 1000 --annotate refFlat_ensembl_chr.txt --processes $(nproc) --scatter --diagram -d ${outdir}
EXPORT VCF cnvkit.py export vcf ${outdir}/${SAMPLE}.analysis.ready.call.cns -i "${SAMPLE}" -o ${outdir}/${SAMPLE}.cnv.vcf

Do you recommend to filter the output VCFs and if yes, which filter parameters would you choose?

I am currently focusing on deletions and duplications.

I also noticed that cnvkit has a much higher min value for DELETIONS (see png) than other tools. How can I resolve that or is that something I should not worry about?

robertzeibich commented 1 year ago

I visualized the fold change vs length and fold change log vs length now (please see my previous response for the steps).

Can I use more than 0.5 fold change or greater than -1 fold change log as filter criteria?

etal commented 1 year ago

Yes, those are reasonable filters. Exporting to VCF from the .call.cns file should generally be OK if you're happy with the filters that went into the .call.cns. You might want to modify -t to ignore low-amplitude gains and losses since you're working with diploid germline/constitutional samples.

For epilepsy maybe take a look at Manta or Parliament2 which should be more sensitive to small-scale alterations and tandem repeats.

etal / cnvkit

Interpretation of WGS data (epilepsy): Where should I commence? #808