PMBio / MetH5Format

HDF5-based container format for Methylation calls from long reads
MIT License
7 stars 2 forks source link

MetH5Format 1.2.0

GitHub license DOI Language Build Status Code style: black

PyPI version PyPI downloads Anaconda Version Anaconda Downloads

MetH5 is an HDF5-based container format for methylation calls from long reads.

In the current version, the MetH5 format can store the following information:

Installation

Through pip:

 $ pip install meth5

Through anaconda:

 $ conda install -c snajder-r meth5

Usage

usage: meth5 [-h] [--chunk_size CHUNK_SIZE] {create_m5,merge_m5,annotate_reads,list_chunks,bedgraph,region_stats} .

Creating a MetH5 file from nanopolish methylation calls

You can create a MetH5 file with the following command, where INPUT_PATH refers to either a nanopolish tsv output file (may or may not be gzipped) or it can be a directory which contains only said files.

$ meth5 create_m5 --input_paths INPUT_PATH1 [INPUT_PATH2 ...] --output_file OUTPUT_FILE.m5

In order to annotate reads with read grouping (for example as samples or haplotypes) you can do so by running:

$ meth annotate_reads --m5file M5FILE.m5 --read_groups_key READ_GROUPS_KEY --read_group_file READ_GROUP_FILE

Where the READ_GROUPS_KEY is the key under which you want to store the annotation (you can store multiple read annotations), and READ_GROUP_FILE is a tab-delimited file containg read name and read group. For example:

read_name   group
7741 f9ee-ad41-42a4-99b2-290c66960410    1
4f18b48e-a1d3-49ad-ace3-cfb96b78ad79    2
...

Experimental: Converting a BAM file with MM and ML tag to MetH5

The MetH5 CLI supports conversion from BAM files with MM and ML tag, as produced by Guppy and Remora. This, however, depends on pysam which at this time does not support the "?" tag written by these callers.

Once issue https://github.com/pysam-developers/pysam/issues/1123 is fixed, you can convert BAM files to MetH5 files using:

$ meth5 convert -i INPUT_FILE1.bam INPUT_FILE2.bam INPUT_FILEN.bam -o OUTPUT_FILE.m5

You can also filter based on chromosomes, in case you are only interested in specific reference contigs:

$ meth5 convert -i INPUT_FILE1.bam INPUT_FILE2.bam INPUT_FILEN.bam -o OUTPUT_FILE.m5 -c chr1 chr2 chr3

Summarize contents of meth5 file

The list_chunks subcommand tells you how many chunks per chromosome are stored in the meth5 container. Pro-tip: combined with --chunk_size 1 you can find out the exact number of methylation calls stored.

$ meth5 list_chunks -i INPUT_FILE.m5
--------- INPUT_FILE.m5 ---------
| Chromosome | Number of chunks |
| 1          | 947              |
| 10         | 497              |
| 11         | 463              |
| 12         | 462              |
| 13         | 284              |
| 14         | 305              |
| 15         | 309              |
| 16         | 484              |
| 17         | 416              |
| 18         | 245              |
| 19         | 361              |
| 2          | 782              |
| 20         | 272              |
| 21         | 187              |
| 3          | 589              |
| 4          | 532              |
| 5          | 536              |
| 6          | 532              |
| 7          | 561              |
| 8          | 467              |
| 9          | 421              |
| X          | 229              |
| Y          | 41               |
---------------------------------

Extracting bedgraph for visualization in IGV

Using the bedgraph subcommand you can extract a bedgraph file containing either methylation rate or coverage from a specific region:

For coverage:

$ meth5 bedgraph -i INPUT_FILE.m5 -r 10:8096651-8117161 -d coverage
track type=bedGraph name=coverage description=center_label visibility=display_mode color=252,127,44 altColor=25,4,248 graphType=heatmap viewLimits=0:1 midRange=0.50:0.50 midColor=255,255,255
10 8096848 8096848 13
10 8096917 8096917 20
10 8097065 8097065 12
10 8097101 8097101 15
[...]

For methylation rate:

$ meth5 bedgraph -i INPUT_FILE.m5 -r 10:8096651-8117161 -d methylation
track type=bedGraph name=methylation description=center_label visibility=display_mode color=252,127,44 altColor=25,4,248 graphType=heatmap viewLimits=0:1 midRange=0.50:0.50 midColor=255,255,255
10 8096848 8096848 0.6153846153846154
10 8096917 8096917 1.0
10 8097065 8097065 0.4166666666666667
10 8097101 8097101 0.7333333333333333
[...]

Summarizing intervals from a BED file

The region_stats subcommand allows extraction of statistics from a number of intervals specified in BED format. The output is written as a tab-separated file with a header line

Providing the following BED file regions.bed

1   8096651     8117161
6   5997999     6007605
3   12287368    12434356

Then we can run:

$ meth5 region_stats -i INPUT_FILE.m5 -b regions.bed -d meth_rate cpgs_covered num_calls
chrom   start       end         meth_rate           cpgs_covered    num_calls
1       8096651     8117161     0.6856130377221707  222             2699
6       5997999     6007605     0.22394436811259413 233             3990
3       12287368    12434356    0.6141979503555554  1103            18816

Quick start for python API

Here an example on how to access methylation values from a MetH5 file:

from meth5.meth5 import MetH5File

with MetH5File(filename, mode="r") as m:
    # List chromosomes in the MetH5 file
    m.get_chromosomes()

    # Access chromosome 7
    chr7 = m["chr7"]

    # Get number of chunks
    chr7.get_number_of_chunks()

    # Get a container that manages the values of chunk 3
    # (note that the data is not yet loaded into memory)
    values = chr7.get_chunk(3)

    # Get the log-likelihood ratios in the container as a numpy array of shape (n,)
    llrs = values.get_llrs()

    # Get the genomic start and end locations for each methylation call in the 
    # chunk as a numpy array  of shape (n,2) 
    ranges = values.get_ranges()

    # Compute methylation rate (beta-score of methylation) for each genomic location,
    # as well as the respective coordinates
    met_rates, met_rate_ranges = values.get_llr_site_rate()

    # You can also compute other aggregates if you like
    met_count, met_count_ranges = values.get_llr_site_aggregate(aggregation_fun=lambda llrs: (llrs>2).sum())

    # Instead of accessing chunk wise, you can query a genomic range
    values = chr7.get_values_in_range(36852906, 37449223)

A more detailed API documentation is in the works. Stay tuned!

Sparse methylation matrix

In addition to accessing methylation calls in its unraveled form, the meth5 library also contains a way to represent the methylation calls as a sparse matrix. Seeing how the values are already stored in the MetH5 file in the same way a coordinate sparse matrix would be stored in memory, this is a very cheap operation. Example:

from meth5.meth5 import MetH5File

with MetH5File(filename, mode="r") as m:
    values = m["chr7"].get_values_in_range(36852906, 37449223)

    # The parameter "read_read_names" allows is to choose whether we want to load the actual
    # read names into memory. It's slightly more expensive than not reading it, so only load them
    # if you are interested in them
    matrix = values.to_sparse_methylation_matrix(read_read_names=True)

    # This is a scipy.sparse.csc_matrix matrix of dimension (r,s), containing the log-likelihood ratios of methylation
    # where r is the number of reads covering the genomic range we selected, and s is the number of unique genomic 
    # ranges for which we have methylation calls. Since an LLR of 0 means total uncertainty, a 0 indicates no call.
    matrix.met_matrix

    # A numpy array of shape (s, ) containing the start position for each unique genomic range
    matrix.genomic_coord
    # A numpy array of shape (s, ) containing the end position for each unique genomic range
    matrix.genomic_coord_end

    # A numpy array of shape (r, ) containing the read names
    matrix.read_names

    # Get a submatrix containing only the first 10 genomic locations
    submatrix = matrix.get_submatrix(0, 10)

    # Get a submatrix containing only the reads in the provided list of read names
    submatrix = matrix.get_submatrix_from_read_names(allowed_read_names)

The MetH5 Format

A MetH5 file is an HDF5 container that stores methylation calls for long reads. The structure of the HDF5 file is as follows:

/
├─ chromosomes
│  ├─ CHROMOSOME_NAME1
│  │  ├─ llr (float dataset of shape (n,))
│  │  ├─ read_id (int dataset of shape (n,))
│  │  ├─ range (int dataset of shape (n,2))
│  │  └─ chunk_ranges (dataset of shape (c, 2))
│  │   
│  ├─ CHROMOSOME_NAME2
│  │  └─ ...
│  └─ ...
└─ reads
   ├─ read_name_mapping (string dataset of shape (r,))
   └─ read_groups
      ├─ READ_GROUP_KEY1 (int dataset of shape (r,))
      ├─ READ_GROUP_KEY2 (int dataset of shape (r,))
      └─ ... 

Where n is the number of methylation calls in the respective chromosome, c is the number of chunks, and ris the total number of reads across all chromosomes.


Citing

If you find the MetH5 format helped you in your research, you can cite the following manuscript:

Snajder R, Leger A, Stegle O, Bonder MJ. pycoMeth: a toolbox for differential methylation testing from Nanopore methylation calls. Genome Biol. 2023;24: 83.

@ARTICLE{Snajder2023-wd,
  title    = "pycoMeth: a toolbox for differential methylation testing from Nanopore methylation calls",
  author   = "Snajder, Rene and Leger, Adrien and Stegle, Oliver and Bonder, Marc Jan",
  journal  = "Genome Biol.", volume = 24, number = 1, pages = "83", month = apr, year = 2023,
}

Authors and contributors