AllenInstitute / AllenSDK

code for reading and processing Allen Institute for Brain Science data
https://allensdk.readthedocs.io/en/latest/
Other
333 stars 149 forks source link

Slow loading of Visual Behaviour Npixels data #2657

Closed seankmartin closed 1 year ago

seankmartin commented 1 year ago

Hi,

The cache.get_ecephys_session(id) is quite slow for me on visual behavior neuropixels data that has already been downloaded (~1min), and takes about 20 times longer than a pynwb command to load the data via NWBHDF5IO.

I was wondering if there was something that might be causing this slowdown due to how I am using the function perhaps? Or is it simply the case that the AllenSDK performs a lot of background calculations after loading the dataset from NWB to add extra information that is time consuming?

Further context

I have tried it on my own system with predownloaded data (Windows 10, python 3.8.10, allensdk 2.14.1, pynwb 2.2.0) - NWB took about 3seconds, AllenSDK about 1minute. I also tried on google colab with allensdk 2.14.1 on data that was mounted from my google drive - NWB took about 3seconds, AllenSDK about 30seconds.

I have included a code snippet that I used to test the times.

from time import perf_counter
from pathlib import Path

from allensdk.brain_observatory.behavior.behavior_project_cache import (
    VisualBehaviorNeuropixelsProjectCache,
)
from pynwb import NWBHDF5IO

ALLEN_DIR = Path(r"D:\example-data\allen-data")
ALLEN_MANIFEST = "visual-behavior-neuropixels_project_manifest_v0.4.0.json"
EXAMPLE_SESSION = 1044385384

def timeit(name):
    def inner(func):
        def wrapper(*args, **kwargs):
            t1 = perf_counter()
            func(*args, **kwargs)
            t2 = perf_counter()

            print(f"{name} took {t2 - t1:.2f} seconds")

        return wrapper

    return inner

@timeit("Allen SDK loader")
def allen_example(cache):
    return cache.get_ecephys_session(EXAMPLE_SESSION)

@timeit("NWB loader")
def nwb_example(nwb_path):
    nwb_io = NWBHDF5IO(nwb_path, "r", load_namespaces=True)
    return nwb_io.read()

def main(cache_is_s3=True):
    if cache_is_s3:
        cache = VisualBehaviorNeuropixelsProjectCache.from_s3_cache(cache_dir=ALLEN_DIR)
    else:
        cache = VisualBehaviorNeuropixelsProjectCache.from_local_cache(
            cache_dir=ALLEN_DIR
        )
    cache.load_manifest(ALLEN_MANIFEST)

    nwb_path = (
        ALLEN_DIR
        / "visual-behavior-neuropixels-0.4.0"
        / "behavior_ecephys_sessions"
        / str(EXAMPLE_SESSION)
        / f"ecephys_session_{EXAMPLE_SESSION}.nwb"
    )

    nwb_example(nwb_path)
    allen_example(cache)

    allen_example(cache)
    nwb_example(nwb_path)

main(cache_is_s3=True)
main(cache_is_s3=False)

And the output of timing:

NWB loader took 2.76 seconds
Allen SDK loader took 56.84 seconds
Allen SDK loader took 50.09 seconds
NWB loader took 2.80 seconds
NWB loader took 2.36 seconds
Allen SDK loader took 51.56 seconds
Allen SDK loader took 56.49 seconds
NWB loader took 7.52 seconds
aamster commented 1 year ago

Hi @seankmartin thanks for providing the script. I tried to reproduce your results and was able to see a noticeable difference in load times but not as extreme as what you found.

Here's what I got when I checked the load time difference between using NWBHDF5IO vs get_ecephys_session:

NWB loader took 2.16 seconds
Allen SDK loader took 14.41 seconds

To answer your question, when we use get_ecephys_session we are extracting the data and turning it into a useful form. For example, pynwb stores detected units as a pynwb-specific abstraction, called pynwb.misc.Units. To turn it into the familiar pandas dataframe, we have to call nwbfile.units.to_dataframe(). This is done here. This takes time since there is a lot of data to unpack. This has to be done for each probe in the dataset. This is just one example where we have to do some work to put the data stored by pynwb into a useful form.

I am surprised that there is such is a big difference between my runtime and yours. It could be due to difference in hardware?

For my local machine:

os.cpu_count()
12

If I run it on a machine with 32 cpus, it takes about half the time.

seankmartin commented 1 year ago

Thank you very much @aamster that's very helpful, and much appreciated!

In terms of hardware: I am on a 4 core machine locally, and google colab has 2 cores. Based on what you said though about the conversions performed to grab the data from the NWB, I figured it might be disk read speed related. So I moved the sample data across from my old hard drive over to a newer SSD, and I get a significant benefit from that 😄 -

NWB loader took 3.07 seconds
Allen SDK loader took 22.06 seconds