freeseek / mocha

MOsaic CHromosomal Alterations (MoChA) caller
MIT License
81 stars 23 forks source link

Incomplete LOH call(s) on Chr12p #42

Closed ryan-ed-bailey closed 9 months ago

ryan-ed-bailey commented 10 months ago

Hi there,

Thanks for developing such a great tool!

My group is currently using MoChA to scan for mosaic CNVs in our cell cultures using data from our Illumina microarray. We have encountered some false negative calls (in which MoChA seems to miss a clear LOH signal) and I'm hoping you can clear up why MoChA seems to be truncating the calls.

We've implemented MoChA using the directions you outlined (including all the upstream steps). Here is the MoChA code we are implementing:

$BCFTOOLS/bcftools +mocha \ --genome $GENOME_NAME \ --no-version \ --output - \ --output-type b \ --variants ^${OUTPUT_PREFIX}.xcl.bcf \ --calls ${OUTPUT_PREFIX}.calls.tsv \ --stats ${OUTPUT_PREFIX}.stats.tsv \ --ucsc-bed ${OUTPUT_PREFIX}.ucsc.bed \ --mhc $MHC_REG \ --kir $KIR_REG \ --cnp $CNP \ --thread 12 \ ${OUTPUT_PREFIX}.bcf | \ tee ${OUTPUT_PREFIX}.as.bcf | \ $BCFTOOLS/bcftools index --force --output ${OUTPUT_PREFIX}.as.bcf.csi

MoChA detects an LOH signal on Chr12p, however the call seems to miss the half of the signal. There are not additional calls made on the Chr12 (no filtering applied to the callset).

Here are the BAF and LRR plots, the yellow box indicates what MoChA detected:

Chr12_LOH

We are curious if there is any way to adjust MoChA so that it captures the entire LOH event, rather than just a portion. This has happened in 2 other cases, at the same loci in separate samples. Any clarifications would be greatly appreciated.

Cheers,

Ryan

freeseek commented 10 months ago

There is no reason why MoChA should drop the other half of the event. The only reason I can think of is that somehow all the variants in that region ended up in the .xcl.bcf file. Without the file or more information I do not have any other good guess. Can you plot the event using mocha_plot.R --mocha? That should give you a better understanding of what is going on

ryan-ed-bailey commented 10 months ago

Thanks for the quick follow-up!

Here's the mocha_plot.R output: image

The same region from a different sample:

image

The pBAF seems to be missing almost all the probes.

Here is the Chr12 xcl.vcf seems to contain many probes over this region, whereas other samples do not.

23MA003_output.xcl.chr12.vcf.csv

Can you clarify the reason why most probes across this region end up in the .xcl* file? What attributes force them to be excluded? What changes should be made to avoid this?

Cheers,

Ryan

freeseek commented 10 months ago

You generated the .xcl.bcf file. I have no idea what you did. My guess from your picture is that most heterozygous sites were not phased in the 12p= sample. Any idea why that would be the case?

ryan-ed-bailey commented 9 months ago

As a follow-up. Indeed probes were ending up in .xcl.bcf. Here is the plot when we don't exclude any variants/probes:

image

The missingness (>3%) filter was the reason for these probes to be excluded, as they weren't called in a subset of the rest of the dataset. We are using SHAPEIT to phase the data, as per the README's directions.

I wanted to ask for clarification on the missingness threshold, especially for smaller datasets. We typically run 64 samples through the pipeline, with some redundancy in genotypes (the same cell line under different conditions). We are screening for mosaic CNVs that tend to arise in cell culture.

  1. Is it advisable to decrease the missingness threshold for smaller datasets (~64 samples)?
  2. For SHAPEIT (and subsequently MoChA), what is the consequence of skipping the variant exclusion step altogther?

Cheers,

Ryan

freeseek commented 9 months ago

If your cell lines have repeated events in the same locus at high cell fraction, this will cause the variants overlapping those events to incur a high rate of missing genotypes. If that is the case, you need to use a different QC strategy or a completely different missingness rate threshold. Of course, this will affect QC in a negative way in other parts of the genome but with only 64 samples you are unlikely to get a good QC anyway. My suggestion is to increase the missingness rate threshold until you get something working for you. From your last plot it seems like phasing worked okay. Given the small sample size I assume you phased including 1000 Genomes project samples as the reference panel

ryan-ed-bailey commented 9 months ago

Thanks again for your prompt reply and assistance! We will work on assessing increasing the missingness threshold and assess the downstream effects on the data.

I have 2 remaining questions:

  1. Can you clarify what you mean by "with only 64 samples you are unlikely to get a good QC anyway"? What QC metrics do you mean, specifically?
  2. [Unrelated]: can you provide a brief description of how CF estimates are generated?

Thanks again! Ryan

freeseek commented 9 months ago

With 64 samples if you have 2 missing genotypes for a given variant this will already give you a missingness rate 2/64>3% and using the recommended thresholds it will exclude that variant from further analysis. This will inevitably make you lose some good variants at sites with recurrent mCAs. As for cf, as explained here, this is a function of bdev and type:

cf = 4 bdev / (1 + 2 bdev) for losses
cf = 2 bdev for CN-LOHs
cf = 4 bdev / (1 - 2 bdev) for gains

This assuming there are enough heterozygous sites to compute the bdev statistic. Notice that at high cell fractions the heterozygous sites will start to be lost and this will inevitably bias the bdev to smaller values

ryan-ed-bailey commented 9 months ago

Thanks for the clarifications!

ryan-ed-bailey commented 9 months ago

Apologies, but one last follow-up!

Can you please define the bdev metric a bit further? Based on the README I understand the following:

bdev - BAF deviation estimate from 0.5

and

| 1/cn - 1/2 | = bdev

Despitecn is not listed in the MoChA output, explicitly, I understand theoretically how the calculation can be made based on copy number (ie. use cn = 3, assuming a 100% gain) and piped into the CF formulas above. However, if you're able to define bdev and how it's calculated specifically for a given CNV, we'd greatly appreciate it!

Thanks again,

Ryan

freeseek commented 9 months ago

The bdev is the best estimated deviation from 50% of the BAF for a given call. BAF values should be centered around 0.5 for a diploid locus but for a given event they are centered around 0.5+bdev and 0.5-bdev. That is what is triggering MoChA to make a call. After a call is made it then tries from LRR to estimate whether it is a gain, a loss, or a CN-LOH