NeurodataWithoutBorders / matnwb

A Matlab interface for reading and writing NWB files
BSD 2-Clause "Simplified" License
49 stars 32 forks source link

[Bug]: Oddly long load times for no discernible reason. #567

Open RR-N opened 1 month ago

RR-N commented 1 month ago

What happened?

I have a number of different NWB files that were created with the same code. Of these, a particular dataset takes an oddly long time to load in, but only in matnwb -- pynwb loads these just fine. For comparison here and here are two datasets that were generated with the same code. I don't think filesize is a factor here, but FWIW, this dataset takes three to four times longer to load than this one or this one, which are both much bigger. May be worth noting that the conversion code was written in python, and that I've confirmed with other people on other systems that this only happens in matnwb.

I ran the MATLAB profiler on the nwbRead function to try and figure out what was going on -- figure that might prove helpful, so attached it here.

Profiler.pdf

Steps to Reproduce

tic
nwbRead(test_file)
toc

Error Message

No response

Operating System

Windows

Matlab Version


MATLAB Version: 23.2.0.2515942 (R2023b) Update 7 MATLAB License Number: 40702059 Operating System: Microsoft Windows 10 Pro Version 10.0 (Build 19045) Java Version: Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode

MATLAB Version 23.2 (R2023b) License 40702059 Simulink Version 23.2 (R2023b) License 40702059 Bioinformatics Toolbox Version 23.2 (R2023b) License 40702059 Computer Vision Toolbox Version 23.2 (R2023b) License 40702059 Data Acquisition Toolbox Version 23.2 (R2023b) License 40702059 Deep Learning HDL Toolbox Version 23.2 (R2023b) License 40702059 Deep Learning Toolbox Version 23.2 (R2023b) License 40702059 GPU Coder Version 23.2 (R2023b) License 40702059 Image Acquisition Toolbox Version 23.2 (R2023b) License 40702059 Image Processing Toolbox Version 23.2 (R2023b) License 40702059 MATLAB Coder Version 23.2 (R2023b) License 40702059 MATLAB Compiler Version 23.2 (R2023b) License 40702059 MATLAB Compiler SDK Version 23.2 (R2023b) License 40702059 MATLAB Test Version 23.2 (R2023b) License 40702059 Mapping Toolbox Version 23.2 (R2023b) License 40702059 Medical Imaging Toolbox Version 23.2 (R2023b) License 40702059 Optimization Toolbox Version 23.2 (R2023b) License 40702059 Parallel Computing Toolbox Version 23.2 (R2023b) License 40702059 Partial Differential Equation Toolbox Version 23.2 (R2023b) License 40702059 Reinforcement Learning Toolbox Version 23.2 (R2023b) License 40702059 Signal Processing Toolbox Version 23.2 (R2023b) License 40702059 SimBiology Version 23.2 (R2023b) License 40702059 Simulink Coder Version 23.2 (R2023b) License 40702059 Simulink Compiler Version 23.2 (R2023b) License 40702059 Statistics and Machine Learning Toolbox Version 23.2 (R2023b) License 40702059 Symbolic Math Toolbox Version 23.2 (R2023b) License 40702059 System Identification Toolbox Version 23.2 (R2023b) License 40702059

Code of Conduct

rly commented 1 month ago

This is related to https://github.com/NeurodataWithoutBorders/pynwb/issues/1889 but specific to matnwb. I think the slowness has to do with the number of groups in the file. @lawrence-mbf do you have any insights?

lawrence-mbf commented 1 month ago

The profile.pdf is nice but unfortunately it seems like the slowness doesn't just come from the MatNWB implementation itself, but the internal hdf5 lib provided by MATLAB. I will have to benchmark this locally and find the places that these hdf5 libs are used.

lawrence-mbf commented 1 month ago

Okay it's like I suspected: The issue is the slowness of H5R.get_name which is painfully slow (it takes 0.11 seconds per call and it is called 1925 times in the example file). Unfortunately, afaik this is the only way to retrieve the full HDF5 path name from the raw reference data which is important for the ObjectView and RegionView objects to search through the in-memory NwbFile references. Does h5py do something different with references that avoids this problem?

rly commented 1 month ago

Are all references resolved on initial read? Can they be deferred?

lawrence-mbf commented 1 month ago

You could cache both the ref data as well as a path to the file and do what types.untyped.DataStub does. That would require a bit of breakage in the classes themselves but may not be that big of a change.