PacificBiosciences / MethBat

A battery of methylation tools for PacBio HiFi reads
BSD 3-Clause Clear License
28 stars 1 forks source link

Rare methylation outliers #16

Open Madelinehazel opened 5 days ago

Madelinehazel commented 5 days ago

Hello,

Thank you for developing this tool, it was very straightforward to install and run! I've run a rare methylation analysis by running methbat profile on 92 Revio HiFi datasets against the provided hg38 CpG island file, and then running methbat build to generate a cohort methylation profile. Then, for each of my 92 datasets, I ran methbat profile with the cohort methylation profile as the --input-regions argument to identify rare methylation states in each dataset.

However, the vast majority of the compare_label values for each dataset are "Uncategorized". An example of the number of each label in one sample is below:

Uncategorized 27559 InsufficientData 4469 HyperASM 6 HyperMethylated 3 HypoMethylated 1

Is this expected? I'm using the latest release (0.13.2).

Thanks, Madeline

holtjma commented 5 days ago

Madeline,

We don't have a great baseline for how many hyper- or hypo- labels we would expect for a given dataset, and I suspect it would depend on the background and the sample. With that said, you would generally expect the number of hyper/hypo labels to be much less than the uncategorized. The "uncategorized" in this context really just means it isn't an outlier with respect to the population, so most regions would fall under that label for a "normal" sample. I would be more concerned if you had a very large number of hyper/hypo relative to the uncategorized group.

Matt

Madelinehazel commented 4 days ago

Hi Matt,

That makes sense! Also, is the code for MethBat publicly available anywhere?

Thanks, Madeline

holtjma commented 4 days ago

Currently, the source code is not available, which is usually how we handle code that is still under early development (i.e., in "alpha" or "beta" mode). Assuming it progresses further, we will likely make it available in the future as we have done with other recent tools, but no guarantees.

For long-read methylation in particular, the data exploration is still relatively new compared to other technologies (e.g., WGBS), and we've found that community analysis expectations are somewhat in flux. If you have any feedback or suggestions, we'd definitely welcome them!

Matt

Madelinehazel commented 3 days ago

Hi Matt,

I'm trying to correlate the MethBat output to what I see in IGV and I'm having trouble doing so. One example is the following CpG island that is apparently hypomethylated in this particular individual relative (top row in IGV screenshots) to the background (bottom coverage rows in IGV screenshots):

chr1 10638242 10640041 Unmethylated HypoMethylated ALL 1 0.0125 1.0 .04800000000000001 -3.639284825978837 0 0 100

Looking in IGV, the region doesn't particularly look differentially methylated relative to a few random samples from the background. Also, the 100 CpGs MethBat looked at in the region are apparently unphased, which I'm not sure I understand as the region is phased.

CASZ1

Here's an example of a hypermethylated outlier, this one with a larger z score (~8.5): chr20 34558331 34559240 Methylated HyperMethylated ALL 1 0.011764705882352941 1.0 0.8 8.460400153812067 0 0 16

MAP1LC3A

Again, I don't see an obvious difference in methylation compare to the background. Should I perhaps only be considering outliers with a larger delta and larger zscore (I used the default 0.2 and 3 respectively)?

And finally - I have the following outlier, which apparently has only one supporting unphased CpG: chr1 3718069 3718537 Methylated HyperMethylated ALL 1 0.014705882352941176 1.0 .0 5.2017639941871225 0 0 1

TP73

So it seems that MethBat does not require a minimum number of CpGs in a region to call outliers. Why does it see so few CpGs in this region? Does it have a minimum supporting read depth per allele?

Thanks for taking the time to answer my questions! Madeline

holtjma commented 3 days ago

Madeline,

It looks like each of these examples is a population issue. Here's the entries re-formatted with a header:

chrom   start   end summary_label   compare_label   background_category category_pop_count  category_pop_freq   asm_fishers_pvalue  mean_hap1_methyl    mean_hap2_methyl    mean_meth_delta mean_abs_meth_delta_zscore  mean_combined_methyl    mean_combined_methyl_zscore num_phased_cpgs num_partial_cpgs    num_unphased_cpgs
chr1 10638242 10640041 Unmethylated HypoMethylated ALL 1 0.0125 1.0 .04800000000000001 -3.639284825978837 0 0 100
chr20 34558331 34559240 Methylated HyperMethylated ALL 1 0.011764705882352941 1.0 0.8 8.460400153812067 0 0 16
chr1 3718069 3718537 Methylated HyperMethylated ALL 1 0.014705882352941176 1.0 .0 5.2017639941871225 0 0 1

Critically, it looks like each entry under category_pop_count is "1", indicating that these comparisons are being done against only one sample in your background population. Without seeing your data, all I can assume is that the background you constructed is missing entries for those regions for some reason. Regardless, comparing to an N of 1 population will be statistically meaningless. Most of the issues are probably sources from this one problem.

Do you know if the background population was constructed without errors? That's another potential source of the error.

Matt

Madelinehazel commented 3 days ago

Hi Matt,

Maybe I'm misunderstanding those columns - based on the documentation, my understanding is that category_pop_count refers to the number of datasets in the background population with the same summary_label i.e. the number of datasets in the background that have say, a summary_label of Unmethylated, while category_pop_freq refers to the percentage of datasets (excluding NoData) in the background population with the same summary_label. So for the examples above, that population frequency ranges from 0.0118 - 0.125, consistent with the summary label being seen in one sample against a background size of 92 samples (where there is some variation depending on the number of datasets that have data at this locus.

Am I missing something here?

Thanks, Madeline

holtjma commented 2 days ago

Madeline,

Am I missing something here?

Sorry, you're correct. It's been a while since I've looked directly at those files, so it was my mistake on the outputs.

Getting back to your examples, I agree that the IGV visuals do not match the entries that you shared, but I'm not entirely sure why with the given information. Here's the key issues I see with each:

  1. First image is labeled as Unmethylated, which possibly supports the first half of the region, but the second half I would expect to counter-balance it into a Uncategorized call. This alone makes me think we're missing something in the processing, regardless of the population status.
  2. Second image is labeled as Methylated but looked Unmethylated for the most part. However, we can see that only 16 unphased CpGs are being used to describe the sample. If those 16 are on the far right, this label would make sense. It seems like there is some heavy filtering happening at this locus. Do you have a read-count filter applied in pb-CpG-tools?
  3. Third image is a similar story to the second image. Based on the coverage of both, I do think there is a filter occurring here that's removing a lot of your CpGs. Regardless, I usually filter anything from my own analysis with less than 10 CpGs.

There's a couple things I think we could do here that might clarify what's happening in your data:

  1. Check the CLI options - There is a --min-coverage option in pb-CpG-tools that is 4 by default. MethBat should not be filtering anything, but if you can provide the CLI options for both that will be helpful to rule out usage as an issue. If I had to hazard a guess, I would bet that you have --min-coverage 10 set. If that's not the case then...
  2. Check the database entries - For each of the above, there should be a database entry with this format: https://github.com/PacificBiosciences/MethBat/blob/main/docs/profile_guide.md#cohortbackground-profiles. We can at least check that those entries make sense.
  3. Check the pb-CpG-tools outputs - The pb-CpG-tools outputs will give us a direct line to any filtering happening upstream of MethBat. We may need to look at those entries to see if they were filtered upstream. If you can provide the database entry and set of CpGs in that region for one of these images, we may be able to figure out what's happening.

Hopefully this helps, sorry again for the earlier mixup on my end! Matt

Madelinehazel commented 2 days ago

No worries Matt, I could have made it more clear by copying in the column header initially! Thank you for the suggestions, I'll look into these and come back with any questions I have.

Madeline