Load hdf5 dataset > 2GB

JiangXL commented 4 years ago

Dear team,

Could I ask for the support to load dataset larger than 2GB? Because I think that hdf5 is chosen for much larger data, which has performances better than stack tiff.

Thanks!!!

imagejan commented 4 years ago

I don't know if this component is still actively developed. You should also consider alternatives such as BDV-HDF5 or n5, maybe.

For what it's worth, here's a Groovy script I once created to read large HDF5 files in chunks of 2GB:

#@ File (label = "HDF5 File Input", style = "extensions:h5/hdf5") h5file
#@ Boolean (label = "Automatic Chunk Size") autoChunkSize
#@ String (visibility = MESSAGE, persist = false, value = "If checked, the following value is ignored") msg
#@ Integer (label = "Chunk Size (number of time points)", min = 1, value = 1000) chunk
#@output imgs
#@ LogService log

import ch.systemsx.cisd.hdf5.HDF5Factory
import ch.systemsx.cisd.hdf5.HDF5DataClass
import ij.ImagePlus
import ij.process.ShortProcessor
import net.imglib2.img.array.ArrayImgs

reader = HDF5Factory.openForReading(h5file)

info = reader.getDataSetInformation("/images")

log.info("Dataset found: $info")

// Make sure we have uint16
assert(!info.getTypeInformation().isSigned()) // u
assert(info.getTypeInformation().getDataClass() == HDF5DataClass.INTEGER) // int
assert(info.getTypeInformation().getElementSize() == 2) // 16

// Make sure we have 3 dimensions (tyx)
dims = info.getDimensions()
assert(dims.length == 3)

// automatically determine optimal chunk size
final twoGiga = 2l * 1024 * 1024 * 1024
optimalChunkSize = twoGiga / (16/8) / dims[2] / dims[1]
log.info("Optimal chunk size: $optimalChunkSize")
if (autoChunkSize) {
    chunk = optimalChunkSize as int
}
log.info("Using chunk size $chunk")

numberOfChunks = ((dims[0] / chunk) as int) + 1
log.info("Creating $numberOfChunks chunks in total")

imgs = []
numberOfChunks.times { index ->
    log.info("Reading chunk ${index+1}")
    shortArray = reader.uint16().readMDArrayBlock("/images", [chunk, dims[1], dims[2]] as int[], [index, 0, 0] as long[])
    // Create ArrayImg from MDShortArray
    aDims = []
    shortArray.dimensions().each { d ->
        aDims << d
    }
    imgs << ArrayImgs.unsignedShorts(shortArray.getAsFlatArray(), aDims.reverse() as long[])
}

// Close HDF5 File
reader.close()

JiangXL commented 4 years ago

Thank for your code! Now, I'm using Big-TIFF for data analysis(by Julia) and visualization(by Fiji). But h5 is still better and native for many programming environments.

MarkRivers commented 1 year ago

The limit is not actually 2GB, it is 2G array elements. This plugin can load HDF5 files that are 32-bit floats with dimensions 1024x1024x2047, which is nearly 8 GB. But it cannot load files with dimensions 1024x1024x2048 of any data type, i.e. 8-bit, 16-bit, or 32-bit.

MarkRivers commented 1 year ago

Note that the HDF5 plugin below will open large HSD5 datasets OK if virtual stack is selected in the dialog box: https://github.com/paulscherrerinstitute/ch.psi.imagej.hdf5

However, for many applications virtual stacks are not what is needed because they are read-only. For example I have an 8 GB signed integer HDF5 dataset, so I need to read it into a real stack and apply calibration to convert the display to correctly show signed integers. This works fine when I read the data from a netCDF-3 file. But the native Java HDF5 reader plugin fails when the number of array elements is 2^31 or greater. This is a serious and rather silly limitation these days when 128 GB of RAM is less than $1,000.

mkitti commented 1 year ago

We can and should fix this by employing imglib2. BigDataViewer and/or n5-viewer can read large HDF5 datasets without difficulty. This is on my todo list after having updated JHDF5 to 19.04.01.

mkitti commented 1 year ago

@MarkRivers if you have a few minutes for a chat, could you contact me at kittisopikulm@janelia.hhmi.org.

fiji / HDF5_Vibez

Load hdf5 dataset > 2GB #13