Open Madelinehazel opened 5 days ago
Madeline,
We don't have a great baseline for how many hyper- or hypo- labels we would expect for a given dataset, and I suspect it would depend on the background and the sample. With that said, you would generally expect the number of hyper/hypo labels to be much less than the uncategorized. The "uncategorized" in this context really just means it isn't an outlier with respect to the population, so most regions would fall under that label for a "normal" sample. I would be more concerned if you had a very large number of hyper/hypo relative to the uncategorized group.
Matt
Hi Matt,
That makes sense! Also, is the code for MethBat publicly available anywhere?
Thanks, Madeline
Currently, the source code is not available, which is usually how we handle code that is still under early development (i.e., in "alpha" or "beta" mode). Assuming it progresses further, we will likely make it available in the future as we have done with other recent tools, but no guarantees.
For long-read methylation in particular, the data exploration is still relatively new compared to other technologies (e.g., WGBS), and we've found that community analysis expectations are somewhat in flux. If you have any feedback or suggestions, we'd definitely welcome them!
Matt
Hi Matt,
I'm trying to correlate the MethBat output to what I see in IGV and I'm having trouble doing so. One example is the following CpG island that is apparently hypomethylated in this particular individual relative (top row in IGV screenshots) to the background (bottom coverage rows in IGV screenshots):
chr1 10638242 10640041 Unmethylated HypoMethylated ALL 1 0.0125 1.0 .04800000000000001 -3.639284825978837 0 0 100
Looking in IGV, the region doesn't particularly look differentially methylated relative to a few random samples from the background. Also, the 100 CpGs MethBat looked at in the region are apparently unphased, which I'm not sure I understand as the region is phased.
Here's an example of a hypermethylated outlier, this one with a larger z score (~8.5): chr20 34558331 34559240 Methylated HyperMethylated ALL 1 0.011764705882352941 1.0 0.8 8.460400153812067 0 0 16
Again, I don't see an obvious difference in methylation compare to the background. Should I perhaps only be considering outliers with a larger delta and larger zscore (I used the default 0.2 and 3 respectively)?
And finally - I have the following outlier, which apparently has only one supporting unphased CpG: chr1 3718069 3718537 Methylated HyperMethylated ALL 1 0.014705882352941176 1.0 .0 5.2017639941871225 0 0 1
So it seems that MethBat does not require a minimum number of CpGs in a region to call outliers. Why does it see so few CpGs in this region? Does it have a minimum supporting read depth per allele?
Thanks for taking the time to answer my questions! Madeline
Madeline,
It looks like each of these examples is a population issue. Here's the entries re-formatted with a header:
chrom start end summary_label compare_label background_category category_pop_count category_pop_freq asm_fishers_pvalue mean_hap1_methyl mean_hap2_methyl mean_meth_delta mean_abs_meth_delta_zscore mean_combined_methyl mean_combined_methyl_zscore num_phased_cpgs num_partial_cpgs num_unphased_cpgs
chr1 10638242 10640041 Unmethylated HypoMethylated ALL 1 0.0125 1.0 .04800000000000001 -3.639284825978837 0 0 100
chr20 34558331 34559240 Methylated HyperMethylated ALL 1 0.011764705882352941 1.0 0.8 8.460400153812067 0 0 16
chr1 3718069 3718537 Methylated HyperMethylated ALL 1 0.014705882352941176 1.0 .0 5.2017639941871225 0 0 1
Critically, it looks like each entry under category_pop_count
is "1", indicating that these comparisons are being done against only one sample in your background population. Without seeing your data, all I can assume is that the background you constructed is missing entries for those regions for some reason. Regardless, comparing to an N of 1 population will be statistically meaningless. Most of the issues are probably sources from this one problem.
Do you know if the background population was constructed without errors? That's another potential source of the error.
Matt
Hi Matt,
Maybe I'm misunderstanding those columns - based on the documentation, my understanding is that category_pop_count
refers to the number of datasets in the background population with the same summary_label
i.e. the number of datasets in the background that have say, a summary_label
of Unmethylated, while category_pop_freq
refers to the percentage of datasets (excluding NoData) in the background population with the same summary_label
. So for the examples above, that population frequency ranges from 0.0118 - 0.125, consistent with the summary label being seen in one sample against a background size of 92 samples (where there is some variation depending on the number of datasets that have data at this locus.
Am I missing something here?
Thanks, Madeline
Madeline,
Am I missing something here?
Sorry, you're correct. It's been a while since I've looked directly at those files, so it was my mistake on the outputs.
Getting back to your examples, I agree that the IGV visuals do not match the entries that you shared, but I'm not entirely sure why with the given information. Here's the key issues I see with each:
Unmethylated
, which possibly supports the first half of the region, but the second half I would expect to counter-balance it into a Uncategorized
call. This alone makes me think we're missing something in the processing, regardless of the population status.Methylated
but looked Unmethylated
for the most part. However, we can see that only 16 unphased CpGs are being used to describe the sample. If those 16 are on the far right, this label would make sense. It seems like there is some heavy filtering happening at this locus. Do you have a read-count filter applied in pb-CpG-tools?There's a couple things I think we could do here that might clarify what's happening in your data:
--min-coverage
option in pb-CpG-tools that is 4 by default. MethBat should not be filtering anything, but if you can provide the CLI options for both that will be helpful to rule out usage as an issue. If I had to hazard a guess, I would bet that you have --min-coverage 10
set. If that's not the case then...Hopefully this helps, sorry again for the earlier mixup on my end! Matt
No worries Matt, I could have made it more clear by copying in the column header initially! Thank you for the suggestions, I'll look into these and come back with any questions I have.
Madeline
Hello,
Thank you for developing this tool, it was very straightforward to install and run! I've run a rare methylation analysis by running
methbat profile
on 92 Revio HiFi datasets against the provided hg38 CpG island file, and then runningmethbat build
to generate a cohort methylation profile. Then, for each of my 92 datasets, I ranmethbat profile
with the cohort methylation profile as the--input-regions
argument to identify rare methylation states in each dataset.However, the vast majority of the
compare_label
values for each dataset are "Uncategorized". An example of the number of each label in one sample is below:Uncategorized 27559 InsufficientData 4469 HyperASM 6 HyperMethylated 3 HypoMethylated 1
Is this expected? I'm using the latest release (0.13.2).
Thanks, Madeline