Closed ryan-ed-bailey closed 9 months ago
There is no reason why MoChA should drop the other half of the event. The only reason I can think of is that somehow all the variants in that region ended up in the .xcl.bcf
file. Without the file or more information I do not have any other good guess. Can you plot the event using mocha_plot.R --mocha
? That should give you a better understanding of what is going on
Thanks for the quick follow-up!
Here's the mocha_plot.R output:
The same region from a different sample:
The pBAF seems to be missing almost all the probes.
Here is the Chr12 xcl.vcf seems to contain many probes over this region, whereas other samples do not.
23MA003_output.xcl.chr12.vcf.csv
Can you clarify the reason why most probes across this region end up in the .xcl* file? What attributes force them to be excluded? What changes should be made to avoid this?
Cheers,
Ryan
You generated the .xcl.bcf
file. I have no idea what you did. My guess from your picture is that most heterozygous sites were not phased in the 12p= sample. Any idea why that would be the case?
As a follow-up. Indeed probes were ending up in .xcl.bcf. Here is the plot when we don't exclude any variants/probes:
The missingness (>3%) filter was the reason for these probes to be excluded, as they weren't called in a subset of the rest of the dataset. We are using SHAPEIT to phase the data, as per the README's directions.
I wanted to ask for clarification on the missingness threshold, especially for smaller datasets. We typically run 64 samples through the pipeline, with some redundancy in genotypes (the same cell line under different conditions). We are screening for mosaic CNVs that tend to arise in cell culture.
Cheers,
Ryan
If your cell lines have repeated events in the same locus at high cell fraction, this will cause the variants overlapping those events to incur a high rate of missing genotypes. If that is the case, you need to use a different QC strategy or a completely different missingness rate threshold. Of course, this will affect QC in a negative way in other parts of the genome but with only 64 samples you are unlikely to get a good QC anyway. My suggestion is to increase the missingness rate threshold until you get something working for you. From your last plot it seems like phasing worked okay. Given the small sample size I assume you phased including 1000 Genomes project samples as the reference panel
Thanks again for your prompt reply and assistance! We will work on assessing increasing the missingness threshold and assess the downstream effects on the data.
I have 2 remaining questions:
Thanks again! Ryan
With 64 samples if you have 2 missing genotypes for a given variant this will already give you a missingness rate 2/64>3% and using the recommended thresholds it will exclude that variant from further analysis. This will inevitably make you lose some good variants at sites with recurrent mCAs. As for cf
, as explained here, this is a function of bdev
and type
:
cf = 4 bdev / (1 + 2 bdev) for losses
cf = 2 bdev for CN-LOHs
cf = 4 bdev / (1 - 2 bdev) for gains
This assuming there are enough heterozygous sites to compute the bdev
statistic. Notice that at high cell fractions the heterozygous sites will start to be lost and this will inevitably bias the bdev
to smaller values
Thanks for the clarifications!
Apologies, but one last follow-up!
Can you please define the bdev
metric a bit further? Based on the README I understand the following:
bdev - BAF deviation estimate from 0.5
and
| 1/cn - 1/2 | = bdev
Despitecn
is not listed in the MoChA output, explicitly, I understand theoretically how the calculation can be made based on copy number (ie. use cn = 3, assuming a 100% gain) and piped into the CF formulas above. However, if you're able to define bdev
and how it's calculated specifically for a given CNV, we'd greatly appreciate it!
Thanks again,
Ryan
The bdev
is the best estimated deviation from 50% of the BAF for a given call. BAF values should be centered around 0.5
for a diploid locus but for a given event they are centered around 0.5+bdev
and 0.5-bdev
. That is what is triggering MoChA to make a call. After a call is made it then tries from LRR to estimate whether it is a gain, a loss, or a CN-LOH
Hi there,
Thanks for developing such a great tool!
My group is currently using MoChA to scan for mosaic CNVs in our cell cultures using data from our Illumina microarray. We have encountered some false negative calls (in which MoChA seems to miss a clear LOH signal) and I'm hoping you can clear up why MoChA seems to be truncating the calls.
We've implemented MoChA using the directions you outlined (including all the upstream steps). Here is the MoChA code we are implementing:
$BCFTOOLS/bcftools +mocha \ --genome $GENOME_NAME \ --no-version \ --output - \ --output-type b \ --variants ^${OUTPUT_PREFIX}.xcl.bcf \ --calls ${OUTPUT_PREFIX}.calls.tsv \ --stats ${OUTPUT_PREFIX}.stats.tsv \ --ucsc-bed ${OUTPUT_PREFIX}.ucsc.bed \ --mhc $MHC_REG \ --kir $KIR_REG \ --cnp $CNP \ --thread 12 \ ${OUTPUT_PREFIX}.bcf | \ tee ${OUTPUT_PREFIX}.as.bcf | \ $BCFTOOLS/bcftools index --force --output ${OUTPUT_PREFIX}.as.bcf.csi
MoChA detects an LOH signal on Chr12p, however the call seems to miss the half of the signal. There are not additional calls made on the Chr12 (no filtering applied to the callset).
Here are the BAF and LRR plots, the yellow box indicates what MoChA detected:
We are curious if there is any way to adjust MoChA so that it captures the entire LOH event, rather than just a portion. This has happened in 2 other cases, at the same loci in separate samples. Any clarifications would be greatly appreciated.
Cheers,
Ryan