MetH5 is an HDF5-based container format for methylation calls from long reads.
In the current version, the MetH5 format can store the following information:
Through pip:
$ pip install meth5
Through anaconda:
$ conda install -c snajder-r meth5
usage: meth5 [-h] [--chunk_size CHUNK_SIZE] {create_m5,merge_m5,annotate_reads,list_chunks,bedgraph,region_stats} .
You can create a MetH5 file with the following command, where INPUT_PATH
refers to either a nanopolish tsv output file (may or may not be gzipped) or it can be a directory which contains only said files.
$ meth5 create_m5 --input_paths INPUT_PATH1 [INPUT_PATH2 ...] --output_file OUTPUT_FILE.m5
In order to annotate reads with read grouping (for example as samples or haplotypes) you can do so by running:
$ meth annotate_reads --m5file M5FILE.m5 --read_groups_key READ_GROUPS_KEY --read_group_file READ_GROUP_FILE
Where the READ_GROUPS_KEY
is the key under which you want to store the annotation (you can store multiple read annotations),
and READ_GROUP_FILE
is a tab-delimited file containg read name and read group. For example:
read_name group
7741 f9ee-ad41-42a4-99b2-290c66960410 1
4f18b48e-a1d3-49ad-ace3-cfb96b78ad79 2
...
The MetH5 CLI supports conversion from BAM files with MM and ML tag, as produced by Guppy and Remora. This, however, depends on pysam which at this time does not support the "?" tag written by these callers.
Once issue https://github.com/pysam-developers/pysam/issues/1123 is fixed, you can convert BAM files to MetH5 files using:
$ meth5 convert -i INPUT_FILE1.bam INPUT_FILE2.bam INPUT_FILEN.bam -o OUTPUT_FILE.m5
You can also filter based on chromosomes, in case you are only interested in specific reference contigs:
$ meth5 convert -i INPUT_FILE1.bam INPUT_FILE2.bam INPUT_FILEN.bam -o OUTPUT_FILE.m5 -c chr1 chr2 chr3
The list_chunks
subcommand tells you how many chunks per chromosome are stored in the meth5 container.
Pro-tip: combined with --chunk_size 1
you can find out the exact number of methylation calls stored.
$ meth5 list_chunks -i INPUT_FILE.m5
--------- INPUT_FILE.m5 ---------
| Chromosome | Number of chunks |
| 1 | 947 |
| 10 | 497 |
| 11 | 463 |
| 12 | 462 |
| 13 | 284 |
| 14 | 305 |
| 15 | 309 |
| 16 | 484 |
| 17 | 416 |
| 18 | 245 |
| 19 | 361 |
| 2 | 782 |
| 20 | 272 |
| 21 | 187 |
| 3 | 589 |
| 4 | 532 |
| 5 | 536 |
| 6 | 532 |
| 7 | 561 |
| 8 | 467 |
| 9 | 421 |
| X | 229 |
| Y | 41 |
---------------------------------
Using the bedgraph
subcommand you can extract a bedgraph file containing either methylation rate or coverage from a specific region:
For coverage:
$ meth5 bedgraph -i INPUT_FILE.m5 -r 10:8096651-8117161 -d coverage
track type=bedGraph name=coverage description=center_label visibility=display_mode color=252,127,44 altColor=25,4,248 graphType=heatmap viewLimits=0:1 midRange=0.50:0.50 midColor=255,255,255
10 8096848 8096848 13
10 8096917 8096917 20
10 8097065 8097065 12
10 8097101 8097101 15
[...]
For methylation rate:
$ meth5 bedgraph -i INPUT_FILE.m5 -r 10:8096651-8117161 -d methylation
track type=bedGraph name=methylation description=center_label visibility=display_mode color=252,127,44 altColor=25,4,248 graphType=heatmap viewLimits=0:1 midRange=0.50:0.50 midColor=255,255,255
10 8096848 8096848 0.6153846153846154
10 8096917 8096917 1.0
10 8097065 8097065 0.4166666666666667
10 8097101 8097101 0.7333333333333333
[...]
The region_stats
subcommand allows extraction of statistics from a number of intervals specified in BED format.
The output is written as a tab-separated file with a header line
Providing the following BED file regions.bed
1 8096651 8117161
6 5997999 6007605
3 12287368 12434356
Then we can run:
$ meth5 region_stats -i INPUT_FILE.m5 -b regions.bed -d meth_rate cpgs_covered num_calls
chrom start end meth_rate cpgs_covered num_calls
1 8096651 8117161 0.6856130377221707 222 2699
6 5997999 6007605 0.22394436811259413 233 3990
3 12287368 12434356 0.6141979503555554 1103 18816
Here an example on how to access methylation values from a MetH5 file:
from meth5.meth5 import MetH5File
with MetH5File(filename, mode="r") as m:
# List chromosomes in the MetH5 file
m.get_chromosomes()
# Access chromosome 7
chr7 = m["chr7"]
# Get number of chunks
chr7.get_number_of_chunks()
# Get a container that manages the values of chunk 3
# (note that the data is not yet loaded into memory)
values = chr7.get_chunk(3)
# Get the log-likelihood ratios in the container as a numpy array of shape (n,)
llrs = values.get_llrs()
# Get the genomic start and end locations for each methylation call in the
# chunk as a numpy array of shape (n,2)
ranges = values.get_ranges()
# Compute methylation rate (beta-score of methylation) for each genomic location,
# as well as the respective coordinates
met_rates, met_rate_ranges = values.get_llr_site_rate()
# You can also compute other aggregates if you like
met_count, met_count_ranges = values.get_llr_site_aggregate(aggregation_fun=lambda llrs: (llrs>2).sum())
# Instead of accessing chunk wise, you can query a genomic range
values = chr7.get_values_in_range(36852906, 37449223)
A more detailed API documentation is in the works. Stay tuned!
In addition to accessing methylation calls in its unraveled form, the meth5
library also contains a way to represent
the methylation calls as a sparse matrix. Seeing how the values are already stored in the MetH5 file in the same way a
coordinate sparse matrix would be stored in memory, this is a very cheap operation. Example:
from meth5.meth5 import MetH5File
with MetH5File(filename, mode="r") as m:
values = m["chr7"].get_values_in_range(36852906, 37449223)
# The parameter "read_read_names" allows is to choose whether we want to load the actual
# read names into memory. It's slightly more expensive than not reading it, so only load them
# if you are interested in them
matrix = values.to_sparse_methylation_matrix(read_read_names=True)
# This is a scipy.sparse.csc_matrix matrix of dimension (r,s), containing the log-likelihood ratios of methylation
# where r is the number of reads covering the genomic range we selected, and s is the number of unique genomic
# ranges for which we have methylation calls. Since an LLR of 0 means total uncertainty, a 0 indicates no call.
matrix.met_matrix
# A numpy array of shape (s, ) containing the start position for each unique genomic range
matrix.genomic_coord
# A numpy array of shape (s, ) containing the end position for each unique genomic range
matrix.genomic_coord_end
# A numpy array of shape (r, ) containing the read names
matrix.read_names
# Get a submatrix containing only the first 10 genomic locations
submatrix = matrix.get_submatrix(0, 10)
# Get a submatrix containing only the reads in the provided list of read names
submatrix = matrix.get_submatrix_from_read_names(allowed_read_names)
A MetH5 file is an HDF5 container that stores methylation calls for long reads. The structure of the HDF5 file is as follows:
/
├─ chromosomes
│ ├─ CHROMOSOME_NAME1
│ │ ├─ llr (float dataset of shape (n,))
│ │ ├─ read_id (int dataset of shape (n,))
│ │ ├─ range (int dataset of shape (n,2))
│ │ └─ chunk_ranges (dataset of shape (c, 2))
│ │
│ ├─ CHROMOSOME_NAME2
│ │ └─ ...
│ └─ ...
└─ reads
├─ read_name_mapping (string dataset of shape (r,))
└─ read_groups
├─ READ_GROUP_KEY1 (int dataset of shape (r,))
├─ READ_GROUP_KEY2 (int dataset of shape (r,))
└─ ...
Where n
is the number of methylation calls in the respective chromosome, c
is the number of chunks, and r
is the total number of reads across all chromosomes.
If you find the MetH5 format helped you in your research, you can cite the following manuscript:
Snajder R, Leger A, Stegle O, Bonder MJ. pycoMeth: a toolbox for differential methylation testing from Nanopore methylation calls. Genome Biol. 2023;24: 83.
@ARTICLE{Snajder2023-wd,
title = "pycoMeth: a toolbox for differential methylation testing from Nanopore methylation calls",
author = "Snajder, Rene and Leger, Adrien and Stegle, Oliver and Bonder, Marc Jan",
journal = "Genome Biol.", volume = 24, number = 1, pages = "83", month = apr, year = 2023,
}