ercius / openNCEM

A collection of packages and tools for electron microscopy data analysis supported by the National Center for Electron Microscopy facility of the Molecular Foundry
GNU General Public License v3.0
55 stars 27 forks source link

EMD `memmap`s can lose their file handles prematurely #31

Closed sezelt closed 2 years ago

sezelt commented 2 years ago

An issue we've encountered in py4DSTEM is that if you load a memory map to an EMD dataset using ncempy.io.emd.fileEMD.get_memmap(N), the handle to the HDF5 file is forcibly closed when the fileEMD object gets garbage collected.

If a fileEMD object gets created inside a function (and is not retained anywhere else), when that function exists the fileEMD will be deleted, calling its __del__ method and closing the file handle: https://github.com/ercius/openNCEM/blob/ab0a40a1354cf7e258efd016de675d2844e7a721/ncempy/io/emd.py#L162-L168

A minimal reproduction of the problem is like so:

from ncempy.io.emd import fileEMD
fpath = "Sample_LFP_v0,12.h5"

def load_memmap(fpath,N):
    f = fileEMD(fpath)
    return f.get_memmap(N)

data, dims = load_memmap(fpath,0)

print(data)
# <Closed HDF5 dataset>

In py4DSTEM we use a context manager for most reading, so my approach when a memory map is asked for is to create a new h5py.File outside of the context manager. While this File will get garbage collected, this does not actually close the file handle because there is still a reference to the Dataset. A good question with my approach is whether the file handle would ever get closed; it's possible that when the Dataset is finally garbage collected then the file gets closed as well, but I haven't tested this.

I am using ncempy version 1.8.1

ercius commented 2 years ago

Thanks for bringing this up! I can see how this would be a problem.

I think your approach is a good solution, and Ill adopt that in an upcoming ncempy release. I don't like the idea that the file object is lost, but h5py documentation explicitly mentions this solution at this link. So, going out of scope with so called 'weak' closing is a suggested way to deal with hdf5 memmaps.

This is a tricky problem, actually. Numpy just totally ignores the issue! See numpy documentation here.

Currently there is no API to close the underlying mmap. It is tricky to ensure the resource is actually closed, since it may be shared between different memmap instances.

Alternatively, a user could keep the fileEMD object in scope by returning the file object with the memmap. Then when you are done with the data you can del the file object and the dataset object. Or just let all of them go out of scope and everything is garbage collected at the same time hopefully closing the object properly. I don't think this is the best option in py4DSTEM since you just want to return data by itself.

This is an example of what I suggested above in case anyone runs into a similar issue.

from ncempy.io.emd import fileEMD
fpath = "Sample_LFP_v0,12.h5"

def load_memmap(fpath, N):
    f = fileEMD(fpath)
    return f, *f.get_memmap(N)

f, data, dims = load_memmap(fpath,0)

print(data)
# <HDF5 dataset "data": shape (10, 10), type "<f8">
del f, data, dims