Open robertzeibich opened 1 year ago
I made some progress. I first used the provided cnvkit files, but then I thought of a more customized approach:
For my first sample, I get the following scatter plot and pdf:
I wonder why the pdf has gain regions with no gene label. Do you know why? Based on my understanding, chr19 and chr21 should be further analyzed.
Please let me know when you think my approach has flaws or makes little sense.
I would highly appreciate a response/your support.
In the scatter plot I see some datapoints supporting single-copy losses at the ends of chr1, chr6, and chr7, and possibly chr17 and chr21. It's suspicious that they're at the very end of each chromosome near the telomere, but still probably worth looking into unless you have another explanation why it might be an artifact.
For germline alterations you can focus your attention on log2 ratios close to -1.0 and ignore the ones around -0.5 or above unless you suspect your sample is mosaic.
CNVkit's plots aren't well suited to WGS, but since you've used bedtools to make these equivalent to targeted amplicon sequencing, the analysis commands that work on .cnr files, like genemetrics, can give useful information.
Internally, here's the handling of each sequencing method:
If your list of genes of interest doesn't match your gene annotations, you're going to have a bad time. Consider checking separately that each gene of interest is in the gene annotations table, and where it isn't, find the matching synonym by looking it up in GeneCards or similar. This is a very normal and painful part of bioinformatics. If you have a list of synonyms you can automate it, but be careful not to swap names that aren't equivalent.
It took me awhile to get back. I am following now a different approach (comparing the output against other SVs/CNVs methods):
BATCH
cnvkit.py batch ${BAM} -n --method wgs -f ${FASTA} --target-avg-size 1000 --annotate refFlat_ensembl_chr.txt --processes $(nproc) --scatter --diagram -d ${outdir}
EXPORT VCF
cnvkit.py export vcf ${outdir}/${SAMPLE}.analysis.ready.call.cns -i "${SAMPLE}" -o ${outdir}/${SAMPLE}.cnv.vcf
Do you recommend to filter the output VCFs and if yes, which filter parameters would you choose?
I am currently focusing on deletions and duplications.
I also noticed that cnvkit has a much higher min value for DELETIONS (see png) than other tools. How can I resolve that or is that something I should not worry about?
I visualized the fold change vs length and fold change log vs length now (please see my previous response for the steps).
Can I use more than 0.5 fold change or greater than -1 fold change log as filter criteria?
Yes, those are reasonable filters. Exporting to VCF from the .call.cns
file should generally be OK if you're happy with the filters that went into the .call.cns
. You might want to modify -t
to ignore low-amplitude gains and losses since you're working with diploid germline/constitutional samples.
For epilepsy maybe take a look at Manta or Parliament2 which should be more sensitive to small-scale alterations and tandem repeats.
I am interested in structural variants/copy number variants (SVs/CNVs) of whole genome sequencing data of patients with Epilepsy, but I do not know where to commence with the analysis once the output files have been generated:
Command: cnvkit.py batch /scratch/xm41/ct/bams/*bam -n --method wgs -t ${TARGETS} -f ${FASTA} --annotate ${ANNOT} --scatter --diagram
In which file can I find the SVs/CNVs? What is the best way to start analyzing my files?
Help would be much appreciated.