PacificBiosciences / MethBat

A battery of methylation tools for PacBio HiFi reads
BSD 3-Clause Clear License
24 stars 0 forks source link

About metagenomic data #8

Closed jxmelody closed 3 months ago

jxmelody commented 3 months ago

Hi,

Thank you for implementing this great tool.

I am trying to use this tool to get methylation profile for my metagenomic data, but I got the error:

[2024-05-26T17:07:43.378Z ERROR methbat] Error while verifying settings: hap1_bed does not exist: "ref_pbmm2_6G_len_1000.hap1.bed"

pb-CpG-tool does not generate hap1 adn hap2 file as it is metagenomic data. But I still wonder can we use this tool to ge methylation profile?

Best, Jin

holtjma commented 3 months ago

Hi Jin,

In its current state, it is expecting to find combined.bed, hap1.bed, and hap2.bed. However, this is clearly a situation where it could generate combined information without any haplotype specific outputs. I'll make a note to see if we can create an option to disable this requirement (and the corresponding phased outputs). It may be a while as I'm currently on FTO.

In the interim, if you want to try to progress with the existing MethBat version, you should be able to just mock the files. I think if you create an empty file with the corresponding filenames (one for hap1, one for hap2), it will recognize it as an empty dataset and continue past the above error.

Matt

jxmelody commented 3 months ago

Hi Matt,

Thank you for your detailed response. As you suggested, I tried to create these two empty hap.bed files. Now I get a different error message, which I guess is about the region file I provided?

$ methbat profile --input-prefix ref_pbmm2_6G_len_1000 --input-regions ref_pbmm2_6G_len_1000_methbat_segment.csv --output-region-profile metabat_profile
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile] Input prefix: "ref_pbmm2_6G_len_1000"
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile] Region profiling:
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile]  Input profile regions: "ref_pbmm2_6G_len_1000_methbat_segment.csv"
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile]  Output profile file: "metabat_profile"
[2024-06-02T11:55:07.223Z INFO  methbat::cli::profile]  Profile selections: ["ALL"]
[2024-06-02T11:55:07.223Z INFO  methbat::cpg_parser] Loading "ref_pbmm2_6G_len_1000.combined.bed"...
[2024-06-02T11:55:07.223Z INFO  methbat::cpg_parser] Model mode auto-detected
[2024-06-02T11:55:08.573Z INFO  methbat::cpg_parser] Loading "ref_pbmm2_6G_len_1000.hap1.bed"...
[2024-06-02T11:55:08.573Z INFO  methbat::cpg_parser] Loading "ref_pbmm2_6G_len_1000.hap2.bed"...
[2024-06-02T11:55:08.573Z INFO  methbat::cpg_parser] Loading "ref_pbmm2_6G_len_1000_methbat_segment.csv"...
[2024-06-02T11:55:08.574Z ERROR methbat] Error while aggregating region statistics: CSV deserialize error: record 1 (line: 2, byte: 30): missing field `chrom`

The region file is generated by

$methbat segment --input-prefix ref_pbmm2_6G_len_1000 --output-segments  ref_pbmm2_6G_len_1000_methbat_segment.csv

This file looks like this:

#chrom  start   end summary_label
581f4cc2897b459a_1  21  4898301 Unmethylated
581f4cc2897b459a_2  22  102511  Unmethylated
581f4cc2897b459a_3  26  5333    Unmethylated
MG1655  21  1299415 Unmethylated
MG1655  1300836 2955015 Unmethylated
MG1655  2955288 4641590 Unmethylated
ATCC_43504  37  148267  Unmethylated
...

Do you have any suggestions?

Also, I have a slight concern about the generated region result. I found the summary label column to be all Unmethylated. Does this suggest something abnormal in my sequencing data or possible data processing mistakes?

Many thanks.

Jin

holtjma commented 3 months ago

Jin,

To be clear, the output of methbat segment is not meant to be sent to methbat profile; those are two separate analysis paths (which is why you're getting an error). MethBat's segmentation algorithm will provide any regions it detects as methylated, unmethylated, or ASM (not applicable here); and then you will need to analyze those regions for your application.

Also, I have a slight concern about the generated region result. I found the summary label column to be all Unmethylated. Does this suggest something abnormal in my sequencing data or possible data processing mistakes?

In human data, we usually see a variety of states, not just unmethylated. I'm not sure what to expect in metagenomics, but I assume at least some would be methylated.

@dportik Do you have any expectations around what we would see if we ran a methylation segmentation algorithm on metagenomic data?

Matt

holtjma commented 3 months ago

@jxmelody,

I spoke with Dan and will relay the key parts of that conversation. The short version is we don't have data on the accuracy of methylation in metagenomics, so we don't really have a clear set of expectations to convey here. We have not performed any validations on metagenomic datasets, so you're definitely in an experimental area just with the baseline 5mC calls. pb-CpG-tools (and MethBat as a consequence) were also designed with diploid organisms in mind, so there are likely some assumptions that will not match up correctly to metagenomics context.

Sorry this probably isn't a satisfactory answer, but let me know if you have any follow up questions!

Matt

jxmelody commented 3 months ago

Dear Matt,

Thank you for your response! Your explanation is very helpful.

I would like to further clarify the difference between using metagenomic data and diploid data for methylation analysis. Is this difference related to the process of obtaining the methylation profile, which typically requires negative control data? I understand that in many cases, whole genome amplification (WGA) is used to generate unmethylated DNA as the control data. And due to a real control data is not always avaliable, many tools use in silico control to do prediction.

Considering this, even though I have HiFi data with kinetic information and MM/ML tags indicating the 5-methylcytosine (5mC) location and score (generated by the SMRT Link tool) in the BAM file, would I still need additional control data to accurately determine the methylation locations? Or do I need to redo the analysis(I think inside the SMRTlink, jasmine is used for 5mC methylation detection)

I also came across a tool called ipdSummary, which is part of the KineticTools suite provided by PacBio. This tool supports the detection of 6mA and 4mC modifications. However, according to an issue I found, ipdSummary may not correctly detect 5mC because it uses metagenomic data for training(from this issue).

Based on this information, my understanding is that the pipeline consisting of Jasmine, pb-CpG-tool, and MethBat is primarily designed for human genome analysis, which predominantly involves 5mC methylation. In contrast, metagenomic data often exhibits a higher prevalence of 6mA and 4mC modifications. This difference in the predominant methylation types could potentially explain why I obtained all unmethylated labels when using MethBat segmentation on my metagenomic data.

Please let me know if my understanding is correct or if there are any additional factors I should consider.

Many, many thanks!

Jin

holtjma commented 3 months ago

Jin,

I would like to further clarify the difference between using metagenomic data and diploid data for methylation analysis. Is this difference related to the process of obtaining the methylation profile, which typically requires negative control data? I understand that in many cases, whole genome amplification (WGA) is used to generate unmethylated DNA as the control data. And due to a real control data is not always avaliable, many tools use in silico control to do prediction.

My understanding is that this is how the models for methylation were trained for human datasets.

Considering this, even though I have HiFi data with kinetic information and MM/ML tags indicating the 5-methylcytosine (5mC) location and score (generated by the SMRT Link tool) in the BAM file, would I still need additional control data to accurately determine the methylation locations? Or do I need to redo the analysis(I think inside the SMRTlink, jasmine is used for 5mC methylation detection)

My apologies, but this isn't really something I can speak to. Thus far, I have only used our methylation tooling for human datasets. Given that your scope of questions is growing, I would recommend contacting your PacBio representative to get more information on how methylation might work with metagenomic datasets. What I can say for sure is that MethBat is not designed to work with non-diploid organisms.

I also came across a tool called ipdSummary, which is part of the KineticTools suite provided by PacBio. This tool supports the detection of 6mA and 4mC modifications. However, according to an https://github.com/PacificBiosciences/kineticsTools/issues/97 I found, ipdSummary may not correctly detect 5mC because it uses metagenomic data for training(from https://github.com/PacificBiosciences/kineticsTools/issues/87).

I'm unfamiliar with this tooling, so I cannot speak to its performance. You would likely get more information by opening an issue there or through your PacBio representative.

Based on this information, my understanding is that the pipeline consisting of Jasmine, pb-CpG-tool, and MethBat is primarily designed for human genome analysis, which predominantly involves 5mC methylation. In contrast, metagenomic data often exhibits a higher prevalence of 6mA and 4mC modifications. This difference in the predominant methylation types could potentially explain why I obtained all unmethylated labels when using MethBat segmentation on my metagenomic data.

I can't speak to the limits of Jasmine, but pb-CpG-tools and MethBat are both intrinsically tied to 5mC and diploid analysis. Anything outside that scope might work, but we are not building the tooling with it in mind.

Sorry there wasn't much I could really answer there, a lot of your questions are beyond the scope of MethBat's functionality at this time.

Matt

jxmelody commented 3 months ago

Matt,

Thank you so much for your kind support. It really helps :)

Jin