etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
545 stars 165 forks source link

Interpretation of WGS data (epilepsy): Where should I commence? #808

Open robertzeibich opened 1 year ago

robertzeibich commented 1 year ago

I am interested in structural variants/copy number variants (SVs/CNVs) of whole genome sequencing data of patients with Epilepsy, but I do not know where to commence with the analysis once the output files have been generated:

Command: cnvkit.py batch /scratch/xm41/ct/bams/*bam -n --method wgs -t ${TARGETS} -f ${FASTA} --annotate ${ANNOT} --scatter --diagram

image

In which file can I find the SVs/CNVs? What is the best way to start analyzing my files?

Help would be much appreciated.

robertzeibich commented 1 year ago

I made some progress. I first used the provided cnvkit files, but then I thought of a more customized approach:

  1. I created a target_chr.bed file, which includes the genomic coordinates of ca. 500 epilepsy genes.
  2. I used the chop command of bedops to create bp baits of 1000 with no overlap (file: processed_target_chr.bed).
  3. I executed the following command: cnvkit.py batch /scratch/xm41/ct/bams/${BAMFILE} -n --method wgs -t processed_target_chr.bed -f ${FASTA} --annotate target_chr.bed --scatter --diagram -d cnvkit_epilepsy_specific

For my first sample, I get the following scatter plot and pdf: GALF003-scatter

GALF003-diagram

I wonder why the pdf has gain regions with no gene label. Do you know why? Based on my understanding, chr19 and chr21 should be further analyzed.

Please let me know when you think my approach has flaws or makes little sense.

I would highly appreciate a response/your support.

etal commented 1 year ago

In the scatter plot I see some datapoints supporting single-copy losses at the ends of chr1, chr6, and chr7, and possibly chr17 and chr21. It's suspicious that they're at the very end of each chromosome near the telomere, but still probably worth looking into unless you have another explanation why it might be an artifact.

For germline alterations you can focus your attention on log2 ratios close to -1.0 and ignore the ones around -0.5 or above unless you suspect your sample is mosaic.

CNVkit's plots aren't well suited to WGS, but since you've used bedtools to make these equivalent to targeted amplicon sequencing, the analysis commands that work on .cnr files, like genemetrics, can give useful information.

Internally, here's the handling of each sequencing method:

If your list of genes of interest doesn't match your gene annotations, you're going to have a bad time. Consider checking separately that each gene of interest is in the gene annotations table, and where it isn't, find the matching synonym by looking it up in GeneCards or similar. This is a very normal and painful part of bioinformatics. If you have a list of synonyms you can automate it, but be careful not to swap names that aren't equivalent.

robertzeibich commented 1 year ago

It took me awhile to get back. I am following now a different approach (comparing the output against other SVs/CNVs methods):

  1. BATCH cnvkit.py batch ${BAM} -n --method wgs -f ${FASTA} --target-avg-size 1000 --annotate refFlat_ensembl_chr.txt --processes $(nproc) --scatter --diagram -d ${outdir}

  2. EXPORT VCF cnvkit.py export vcf ${outdir}/${SAMPLE}.analysis.ready.call.cns -i "${SAMPLE}" -o ${outdir}/${SAMPLE}.cnv.vcf

Do you recommend to filter the output VCFs and if yes, which filter parameters would you choose?

I am currently focusing on deletions and duplications.

image

I also noticed that cnvkit has a much higher min value for DELETIONS (see png) than other tools. How can I resolve that or is that something I should not worry about?

robertzeibich commented 1 year ago

I visualized the fold change vs length and fold change log vs length now (please see my previous response for the steps).

image

image

Can I use more than 0.5 fold change or greater than -1 fold change log as filter criteria?

etal commented 1 year ago

Yes, those are reasonable filters. Exporting to VCF from the .call.cns file should generally be OK if you're happy with the filters that went into the .call.cns. You might want to modify -t to ignore low-amplitude gains and losses since you're working with diploid germline/constitutional samples.

For epilepsy maybe take a look at Manta or Parliament2 which should be more sensitive to small-scale alterations and tandem repeats.