PMBio / MetH5Format

HDF5-based container format for Methylation calls from long reads
MIT License
7 stars 2 forks source link

Increase support for importing and converting data in multiple formats #3

Open starsyi opened 1 year ago

starsyi commented 1 year ago

Is it also possible to support importing other formats, such as bed and tsv result files that output methylation modification information from modkit?

Since nanopolish does not support the latest R10.4 chemistry method and dorado/remora is now the standard method for obtaining nanopore methylation calls, it would be great to be able to use meth5 and pycometh with modbams generated by remora.

snajder-r commented 1 year ago

The nanopolish output that is currently supported is a TSV file. If you can format your TSV file to contain the required columns (see below) you can import it as if it were a nanopolish output:

    chromosome    start        end          read_name      log_lik_ratio
    chr1          30012312     30012312     aksdlaksdlas   -4.542

The order of the columns does not matter either, as long as you have a single header line with these column names. Personally I don't have the capacity right now to implement explicit conversion commands for modkit or other tools, but I'll leave the issue open and will be happy to accept pull requests that come with test data.

PanZiwei commented 10 months ago

Nanopolish calls 5mCs with a log-likelihood ratio and set up a specific cutoff for methylation calling, but other tools like DeepSignal or Guppy predict a methylation calling probablity for each site instead, and these 2 values can't be converted as far as I know. How to solve the issue?

In thse case, is the log_lik_ratio conversion column necessary for the conversion? Does the column support methylation probablity? How does it contribute to the meth5 conversion? Thanks!

snajder-r commented 10 months ago

The column is required you'll need to convert from methylation probability (range 0-1) to log likelihood ratio (range negative infinity - positive infinity).

Assuming an uninformative prior, use the logit function to convert:

log_lik_ratio = logit(p) = ln(p/(1-p))