Closed seankmartin closed 1 year ago
Hi @seankmartin thanks for providing the script. I tried to reproduce your results and was able to see a noticeable difference in load times but not as extreme as what you found.
Here's what I got when I checked the load time difference between using NWBHDF5IO
vs get_ecephys_session
:
NWB loader took 2.16 seconds
Allen SDK loader took 14.41 seconds
To answer your question, when we use get_ecephys_session
we are extracting the data and turning it into a useful form. For example, pynwb
stores detected units as a pynwb-specific abstraction, called pynwb.misc.Units
. To turn it into the familiar pandas dataframe, we have to call nwbfile.units.to_dataframe()
. This is done here. This takes time since there is a lot of data to unpack. This has to be done for each probe in the dataset. This is just one example where we have to do some work to put the data stored by pynwb into a useful form.
I am surprised that there is such is a big difference between my runtime and yours. It could be due to difference in hardware?
For my local machine:
os.cpu_count()
12
If I run it on a machine with 32 cpus, it takes about half the time.
Thank you very much @aamster that's very helpful, and much appreciated!
In terms of hardware: I am on a 4 core machine locally, and google colab has 2 cores. Based on what you said though about the conversions performed to grab the data from the NWB, I figured it might be disk read speed related. So I moved the sample data across from my old hard drive over to a newer SSD, and I get a significant benefit from that 😄 -
NWB loader took 3.07 seconds
Allen SDK loader took 22.06 seconds
Hi,
The
cache.get_ecephys_session(id)
is quite slow for me on visual behavior neuropixels data that has already been downloaded (~1min), and takes about 20 times longer than a pynwb command to load the data viaNWBHDF5IO
.I was wondering if there was something that might be causing this slowdown due to how I am using the function perhaps? Or is it simply the case that the AllenSDK performs a lot of background calculations after loading the dataset from NWB to add extra information that is time consuming?
Further context
I have tried it on my own system with predownloaded data (Windows 10, python 3.8.10, allensdk 2.14.1, pynwb 2.2.0) - NWB took about 3seconds, AllenSDK about 1minute. I also tried on google colab with allensdk 2.14.1 on data that was mounted from my google drive - NWB took about 3seconds, AllenSDK about 30seconds.
I have included a code snippet that I used to test the times.
And the output of timing: