frejanordsiek / hdf5storage

Python package to read and write a wide range of Python types to/from HDF5 formatted files. Can read/write data to the HDF5 based Matlab v7.3 MAT files.
BSD 2-Clause "Simplified" License
83 stars 24 forks source link

Overview

This Python package provides high level utilities to read/write a variety of Python types to/from HDF5 (Heirarchal Data Format) formatted files. This package also provides support for MATLAB MAT v7.3 formatted files, which are just HDF5 files with a different extension and some extra meta-data.

All of this is done without pickling data. Pickling is bad for security because it allows arbitrary code to be executed in the interpreter. One wants to be able to read possibly HDF5 and MAT files from untrusted sources, so pickling is avoided in this package.

The package's documetation is found at http://pythonhosted.org/hdf5storage/

The package's source code is found at https://github.com/frejanordsiek/hdf5storage

The package is licensed under a 2-clause BSD license (https://github.com/frejanordsiek/hdf5storage/blob/master/COPYING.txt).

Installation

Dependencies

This package only supports Python >= 3.7. Python < 3.7 support was dropped in version 0.2.

This package requires the python packages to run

Note that support for h5py <https://pypi.org/project/h5py>_ 2.1 to 3.2.x has been dropped in version 0.2. This package also has the following optional dependencies

Installing by pip

This package is on PyPI <https://pypi.org> at hdf5storage <https://pypi.org/project/hdf5storage>. To install hdf5storage using pip, run the command::

pip install hdf5storage

Installing from Source

To install hdf5storage from source, setuptools <https://pypi.org/project/setuptools>_ >= 61.0.0 is required. Download this package and then install the dependencies ::

pip install -r requirements.txt

Then to install the package, run either ::

pip install .

Running Tests

For testing, the package pytest <https://pypi.org/project/pytest> (>= 6.0) is additionally required. There are some tests that require Matlab and scipy <https://pypi.org/project/scipy> to be installed and be in the executable path respectively. In addition, there are some tests that require Julia <http://julialang.org/> with the MAT <https://github.com/simonster/MAT.jl> package. Not having them means that those tests cannot be run (they will be skipped) but all the other tests will run. To install all testing dependencies, other than scipy <https://pypi.org/project/scipy>_, Julia, Matlab run ::

pip install -r requirements_tests.txt.

To run the tests ::

pytest

Building Documentation

The documentation additionally requires the following packages

The documentation dependencies can be installed by ::

pip install -r requirements_doc.txt

To build the HTML documentation, run either ::

sphinx-build doc/source doc/build/html

Development

All Python code is formatted using black <https://pypi.org/project/black>_. Releases and Pull Requests should pass all unit tests, and ideally pass type checking and have no warnings found by linting.

Type Checking

This package now has type annotations since version 0.2, which can be checked with a type checker like mypy <https://pypi.org/project/mypy>. To check with mypy <https://pypi.org/project/mypy>, run ::

mypy -p hdf5storage

Linting

This package has the configuration in pyproject.toml for linting with

To lint with ruff <https://pypi.org/project/ruff>_, run ::

ruff .

To lint with pylint <https://pypi.org/project/pylint>_, run ::

pylint src/*/*.py

Python 2

This package no longer supports Python 2.6 and 2.7. This package was designed and written for Python 3, then backported to Python 2.x, and then support dropped. But it can still read files made by version 0.1.x of this library with Python 2.x, and this package still tries to write files compatible with 0.1.x when possible.

Hierarchal Data Format 5 (HDF5)

HDF5 files (see http://www.hdfgroup.org/HDF5/) are a commonly used file format for exchange of numerical data. It has built in support for a large variety of number formats (un/signed integers, floating point numbers, strings, etc.) as scalars and arrays, enums and compound types. It also handles differences in data representation on different hardware platforms (endianness, different floating point formats, etc.). As can be imagined from the name, data is represented in an HDF5 file in a hierarchal form modelling a Unix filesystem (Datasets are equivalent to files, Groups are equivalent to directories, and links are supported).

This package interfaces HDF5 files using the h5py package (http://www.h5py.org/) as opposed to the PyTables package (http://www.pytables.org/).

MATLAB MAT v7.3 file support

MATLAB (http://www.mathworks.com/) MAT files version 7.3 and later are HDF5 files with a different file extension (.mat) and a very specific set of meta-data and storage conventions. This package provides read and write support for a limited set of Python and MATLAB types.

SciPy (http://scipy.org/) has functions to read and write the older MAT file formats. This package has functions modeled after the scipy.io.savemat and scipy.io.loadmat functions, that have the same names and similar arguments. The dispatch to the SciPy versions if the MAT file format is not an HDF5 based one.

Supported Types

The supported Python and MATLAB types are given in the tables below. The tables assume that one has imported collections and numpy as::

import collections as cl
import numpy as np

The table gives which Python types can be read and written, the first version of this package to support it, the numpy type it gets converted to for storage (if type information is not written, that will be what it is read back as) the MATLAB class it becomes if targetting a MAT file, and the first version of this package to support writing it so MATlAB can read it.

+--------------------+---------+-------------------------+-------------+---------+-------------------+ | Python | MATLAB | Notes | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | Type | Version | Converted to | Class | Version | | +====================+=========+=========================+=============+=========+===================+ | bool | 0.1 | np.bool_ or np.uint8 | logical | 0.1 | [1] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | None | 0.1 | np.float64([]) | [] | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | Ellipsis | 0.2 | np.float64([]) | [] | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | NotImplemented | 0.2 | np.float64([]) | [] | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | int | 0.1 | np.int64 or np.bytes_ | int64 | 0.1 | [2] [3] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | long | 0.1 | np.int64 or np.bytes_ | int64 | 0.1 | [3] [4] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | float | 0.1 | np.float64 | double | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | complex | 0.1 | np.complex128 | double | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | str | 0.1 | np.uint32/16 | char | 0.1 | [5] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | bytes | 0.1 | np.bytes_ or np.uint16 | char | 0.1 | [6] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | bytearray | 0.1 | np.bytes_ or np.uint16 | char | 0.1 | [6] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | list | 0.1 | np.object_ | cell | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | tuple | 0.1 | np.object_ | cell | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | set | 0.1 | np.object_ | cell | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | frozenset | 0.1 | np.object_ | cell | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | cl.deque | 0.1 | np.object_ | cell | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | cl.ChainMap | 0.2 | np.object_ | cell | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | dict | 0.1 | | struct | 0.1 | [7] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | cl.OrderedDict | 0.2 | | struct | 0.2 | [7] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | cl.Counter | 0.2 | | struct | 0.2 | [7] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | slice | 0.2 | | struct | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | range | 0.2 | | struct | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | datetime.timedelta | 0.2 | | struct | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | datetime.timezone | 0.2 | | struct | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | datetime.date | 0.2 | | struct | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | datetime.time | 0.2 | | struct | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | datetime.datetime | 0.2 | | struct | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | fractions.Fraction | 0.2 | | struct | 0.2 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.bool_ | 0.1 | | logical | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.void | 0.1 | | | | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.uint8 | 0.1 | | uint8 | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.uint16 | 0.1 | | uint16 | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.uint32 | 0.1 | | uint32 | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.uint64 | 0.1 | | uint64 | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.uint8 | 0.1 | | int8 | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.int16 | 0.1 | | int16 | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.int32 | 0.1 | | int32 | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.int64 | 0.1 | | int64 | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.float16 | 0.1 | | | | [8] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.float32 | 0.1 | | single | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.float64 | 0.1 | | double | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.complex64 | 0.1 | | single | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.complex128 | 0.1 | | double | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.str_ | 0.1 | np.uint32/16 | char/uint32 | 0.1 | [5] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.bytes_ | 0.1 | np.bytes_ or np.uint16 | char | 0.1 | [6] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.object_ | 0.1 | | cell | 0.1 | | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.ndarray | 0.1 | see notes | see notes | 0.1 | [9] [10] [11] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.matrix | 0.1 | see notes | see notes | 0.1 | [9] [12] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.chararray | 0.1 | see notes | see notes | 0.1 | [9] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.recarray | 0.1 | structured np.ndarray | see notes | 0.1 | [9] [10] | +--------------------+---------+-------------------------+-------------+---------+-------------------+ | np.dtype | 0.2 | np.bytes_ or np.uint16 | char | 0.2 | [6] [13] | +--------------------+---------+-------------------------+-------------+---------+-------------------+

.. [1] Depends on the selected options. Always np.uint8 when doing MATLAB compatiblity, or if the option is explicitly set. .. [2] In Python 2.x with the 0.1.x version of this package, it may be read back as a long if it can't fit in the size of an int. .. [3] Stored as a np.int64 if it is small enough to fit. Otherwise its decimal string representation is stored as an np.bytes_ for hdf5storage >= 0.2 (error in earlier versions). .. [4] Type found only in Python 2.x. Python 2.x's long and int are unified into a single int type in Python 3.x. Read as an int in Python 3.x. .. [5] Depends on the selected options and whether it can be converted to UTF-16 without using doublets. If the option is explicity set (or implicitly when doing MATLAB compatibility) and it can be converted to UTF-16 without losing any characters that can't be represented in UTF-16 or using UTF-16 doublets (MATLAB doesn't support them), then it is written as np.uint16 in UTF-16 encoding. Otherwise, it is stored at np.uint32 in UTF-32 encoding. .. [6] Depends on the selected options. If the option is explicitly set (or implicitly when doing MATLAB compatibility), it will be stored as np.uint16 in UTF-16 encoding unless it has non-ASCII characters in which case a NotImplementedError is thrown). Otherwise, it is just written as np.bytes_. .. [7] Stored either as each key-value as their own Dataset or as two Datasets, one for keys and one for values. The former is used if all keys can be converted to str and they don't have null characters ('\x00') or forward slashes ('/') in them. Otherwise, the latter format is used. .. [8] np.float16 are not supported for h5py versions before 2.2. Version 2.3 or higher is required for this package since version 0.2. .. [9] Container types are only supported if their underlying dtype is supported. Data conversions are done based on its dtype. .. [10] Structured np.ndarray s (have fields in their dtypes) can be written as an HDF5 COMPOUND type or as an HDF5 Group with Datasets holding its fields (either the values directly, or as an HDF5 Reference array to the values for the different elements of the data). Can only be written as an HDF5 COMPOUND type if none of its field are of dtype 'object'. Field names cannot have null characters ('\x00') and, when writing as an HDF5 GROUP, forward slashes ('/') in them. .. [11] Structured np.ndarray s with no elements, when written like a structure, will not be read back with the right dtypes for their fields (will all become 'object'). .. [12] Will be read back as a np.ndarray if the np.matrix class is removed. .. [13] Stored in their string representation.

This table gives the MATLAB classes that can be read from a MAT file, the first version of this package that can read them, and the Python type they are read as.

+-----------------+---------+-------------------------------------+ | MATLAB Class | Version | Python Type | +=================+=========+=====================================+ | logical | 0.1 | np.bool_ | +-----------------+---------+-------------------------------------+ | single | 0.1 | np.float32 or np.complex64 [14] | +-----------------+---------+-------------------------------------+ | double | 0.1 | np.float64 or np.complex128 [14] | +-----------------+---------+-------------------------------------+ | uint8 | 0.1 | np.uint8 | +-----------------+---------+-------------------------------------+ | uint16 | 0.1 | np.uint16 | +-----------------+---------+-------------------------------------+ | uint32 | 0.1 | np.uint32 | +-----------------+---------+-------------------------------------+ | uint64 | 0.1 | np.uint64 | +-----------------+---------+-------------------------------------+ | int8 | 0.1 | np.int8 | +-----------------+---------+-------------------------------------+ | int16 | 0.1 | np.int16 | +-----------------+---------+-------------------------------------+ | int32 | 0.1 | np.int32 | +-----------------+---------+-------------------------------------+ | int64 | 0.1 | np.int64 | +-----------------+---------+-------------------------------------+ | char | 0.1 | np.str_ | +-----------------+---------+-------------------------------------+ | struct | 0.1 | structured np.ndarray or dict [15] | +-----------------+---------+-------------------------------------+ | cell | 0.1 | np.object\ | +-----------------+---------+-------------------------------------+ | canonical empty | 0.1 | np.float64([]) | +-----------------+---------+-------------------------------------+

.. [14] Depends on whether there is a complex part or not. .. [15] Controlled by an option.

File Incompatibilities

The storage of empty numpy.ndarray (or objects that would be stored like one) when the Options.store_shape_for_empty (implicitly set when Matlab compatibility is enabled) is incompatible with the main branch of this package before 2021-07-11 as well as all 0.1.x versions of this package since they have a bug (Issue #114). The incompatibility is caused by those versions storing the array shape in the Dataset after reversing the dimension order instead of before, meaning that the array is read with its dimensions reversed from what is expected if read after the bug fix or by Matlab.

Versions

0.2. Feature release adding/changing the following, including some API breaking changes.

0.1.19. Bugfix release.

0.1.18. Performance improving release.

0.1.17. Bugfix and deprecation workaround release that fixed the following.

0.1.16. Bugfix release that fixed the following bugs.

0.1.15. Bugfix release that fixed the following bugs.

0.1.14. Bugfix release that also added a couple features.

0.1.13. Bugfix release fixing the following bug.

0.1.12. Bugfix release fixing the following bugs. In addition, copyright years were also updated and notices put in the Matlab files used for testing.

0.1.11. Bugfix release fixing the following.

0.1.10. Minor feature/performance fix release doing the following.

0.1.9. Bugfix and minor feature release doing the following.

0.1.8. Bugfix release fixing the following two bugs.

0.1.7. Bugfix release with an added compatibility option and some added test code. Did the following.

0.1.6. Bugfix release fixing a bug with determining the maximum size of a Python 2.x int on a 32-bit system.

0.1.5. Bugfix release fixing the following bug.

0.1.4. Bugfix release fixing the following bugs. Thanks goes to mrdomino <https://github.com/mrdomino>_ for writing the bug fixes.

0.1.3. Bugfix release fixing the following bug.

0.1.2. Bugfix release fixing the following bugs.

0.1.1. Bugfix release fixing the following bugs.

0.1. Initial version.