AllenInstitute / AllenSDK

code for reading and processing Allen Institute for Brain Science data
https://allensdk.readthedocs.io/en/latest/
Other
346 stars 151 forks source link

loading behavior_ophys_experiment is very slow #2306

Closed matchings closed 2 years ago

matchings commented 2 years ago

Describe the bug Recently loading the behavior_ophys_experiment object using the code below has become extremely slow. It used to take a few seconds, now it can take on the order of minutes. This is true when loading datasets from lims and from a saved s3 cache of NWB files, but is especially slow when using the .from_lims() method.

To Reproduce Here is an example of loading a Visual Behavior dataset from a saved S3 cache, for ophys_experiment_id = On my machine, this took 21.1 seconds.

from allensdk.brain_observatory.behavior.behavior_project_cache import VisualBehaviorOphysProjectCache cache_dir = r'\\allen\programs\braintv\workgroups\nc-ophys\visual_behavior\platform_paper_cache' cache = VisualBehaviorOphysProjectCache.from_s3_cache(cache_dir=cache_dir) experiments_table = cache.get_ophys_experiment_table() ophys_experiment_id = experiments_table.index[0] dataset = cache.get_behavior_ophys_experiment(ophys_experiment_id)

image

Here is an example of loading a recent learning mFISH project dataset from lims, for ophys_experiment_id = On my machine, this took 3 minutes 56 seconds.

from allensdk.brain_observatory.behavior.behavior_project_cache import VisualBehaviorOphysProjectCache cache = VisualBehaviorOphysProjectCache.from_lims() experiments_table = cache.get_ophys_experiment_table(passed_only=False) ophys_experiment_ids = experiments_table[experiments_table.project_code=='LearningmFISHTask1A'].index.values ophys_experiment_id = ophys_experiment_ids[10] dataset = cache.get_behavior_ophys_experiment(ophys_experiment_id)

image

Expected behavior Loading a dataset should be fast.

Environment (please complete the following information): AllenSDK 2.13.3

Additional context This makes analysis of multiple experiments painfully slow

Do you want to work on this issue? Happy to help troubleshoot

danielsf commented 2 years ago

@matchings can you try loading the NWB file "by hand" and see if it is also slower?

Isilon is undergoing some heavy maintenance this quarter. All I/O from /allen is suffering (try working with ophys movies every day...)

I suspect this will also impact from_lims() since most of those methods have to find and load pickle and sync files from Isilon.

Anyway: if you can do a quick check to see if this is actually a hardware issue, it would be greatly appreciated.

matchings commented 2 years ago

@danielsf i'd be happy to test this, but im not sure what you mean by "by hand". Using pynwb?

danielsf commented 2 years ago

@matchings Yes. However you would do it without the SDK. I just want to get a sense of whether our software or our hardware is the problem. If the latter, then any attempt to read the NWB file, even without using the SDK, should be slower than expected.

matchings commented 2 years ago

@danielsf yeah i have no idea how to load NWB files without the SDK and i dont really have time at the moment to figure that out, so unless someone has some example code for me to follow, i think that particular test might have to wait.

another option would be for me to move a few NWB files somewhere other than \allen and see if that loads faster with the SDK. thats probably easier so i will try that as soon as i get a chance.

it seems pretty likely that the Isilon maintenance could be the cause (i did not know about this work before so thanks for informing me), especially if you are dealing with the same thing (i cant even imagine trying to work with ophys movies under these conditions, im sure that is very frustrating)

aamster commented 2 years ago

Hi @matchings I looked into this. I pulled the largest NWB file in the VBO release (experiment 877018118) and profiled how long it takes to load. It took 70 seconds, 52 of which were spent doing tolist(). We do some expensive tolist() on traces that contain (in this case 155 million values) in order to convert the array into a list in order to create a column in a dataframe with this list. I believe you requested that the traces dataframe contain a column indicating the dff values. I recommend to instead either return the raw array, or a dataframe in which each column is a timestep in the trace vector for a given ROI, ie pd.DataFrame(traces) instead of pd.DataFrame({'dff': traces.tolist()}). This will greatly speed up loading this data.

matchings commented 2 years ago

Hey @aamster, I don’t think changing the structure of the traces data is the best idea as it will likely break analysis code for internal and external users who are expecting a specific format.

@mattjdavis also did some profiling recently and found some other steps that were causing things to slow down, including the stimulus dewarping which is a much less critical piece of data. Maybe the two of you can discuss and come up with other options?

aamster commented 2 years ago

Since 75% of time is spent in tolist() I don't think it makes sense to work on improving efficiency in any other part of the code as it will not give us much in terms of a speedup and may be time consuming to improve. I'm marking this as won't fix, unless we can remove tolist()

mattjdavis commented 2 years ago

Thanks for the clarification Adam. I was looking at loading datasets through the SDK directly from_lims, while it looks like the expensive tolist() has to do with loading from NWB files. The two loading strategies may have different reasons for being slow?

here is my hack for from_lims loading. creating a flag to skip the stimulus unwarping step (which we generally don't need for every experiment, probably could be precalculated and stored elsewhere for each stim, or loaded at the session level) results in a 5x speedup in loading: https://github.com/AllenInstitute/AllenSDK/commit/a1805bc610866ab34e9aec0c520e212b9a8d4cb1

@matchings do we care to speed up loading via NWB?

aamster commented 2 years ago

@mattjdavis we do convert traces to a python list in both reading from nwb and lims. However, when we load from LIMS we use a list comprehension while when loading from nwb we use tolist(). It turns out that list comprehension is much faster than tolist(). It is still slower than not converting to list, but it is a middle ground. The load time of the NWB file decreased from 70 to 10 seconds when I made this change. I think this should address much of the "slowness" when loading experiments from NWB.

When loading from NWB we don't unwarp the image; we just load it from the file. But when loading from LIMS we do unwarp. Yeah we could potentially do something there to either precompute or load it at the session level as you said. It would still take ~100 seconds to load the largest released VBO experiment from LIMS even without the unwarping.

matchings commented 2 years ago

I am more concerned about the speed of loading using the .from_lims() method. As I described in the original issue post, the speed of loading data using .from_lims() is an order of magnitude slower than using .from_s3_cache().

@mattjdavis how long is it taking to load an experiment using .from_lims() based on the more recent profiling you did?

matchings commented 2 years ago

@aamster i am fine switching the NWB loading from tolist() to a list comprehension as long as the data structure itself doesn't change.

aamster commented 2 years ago

@matchings tolist() returns list of lists while list comprehension will return list of numpy arrays. Currently from_nwb returns list of lists while from_lims returns list of numpy arrays

mattjdavis commented 2 years ago

@matchings here is what I see for loading for LAMF mouse, 10 diff sessions. Generally haven't seen the ~4mins loading time you previously observed. Did this for BehaviorSession and OphsyExperiment (the latter are all image stimuli experiments)

beh_time

ophys_time

When I use the cache strategy (rather than directly loading the objects as I have done above), things take around twice as long (1min 20 secs), e.g. dataset = cache.get_behavior_ophys_experiment(ophys_experiment_id). I don't know when one would prefer to use the cache, also realize there is another issue floating around address cache memory issues.

mattjdavis commented 2 years ago

Also I ran you code you posted in the opening of the issue (which is ophys_experiment_id: 1166453355) and got ~30 seconds. When I direct load the object for that id (BehaviorOphysExperiment.from_lims(1166453355)) I get about 10 seconds.

aamster commented 2 years ago

The most expensive part of loading from_lims is reading the trace array from disk. As was Scott's original response to the slow load time, it appears /allen can be 100 times slower than loading the same file outside of /allen. If I load traces from /allen it can take 50 seconds, while on /local1 it consistently takes 0.6 seconds. If we can fix this, this will address much of the slow load time from LIMS. I've reached out to IT for an explanation.

matchings commented 2 years ago

This is great, thank you so much @aamster and @mattjdavis !!

matchings commented 2 years ago

@aamster any word from IT regarding why loading from /allen is so slow?

aamster commented 2 years ago

@matchings I found out a few things about /allen. It is a network drive and so has network latency. It has a theoretical max bandwidth of 125 mb/sec, which to me seems slow and also that is the theoretical max. After 1 day of a file getting created, the file gets moved to a slower drive. I tested this by reading a file that has been on /allen for more than a day, vs making a copy of that same file (in which case it is new). Reading the newer file was much faster. There is also complicated caching also which I don't fully understand which affects read times. I was surprised to learn that reading a file from /allen can be up to 100x slower than reading from a local drive.

matchings commented 2 years ago

@aamster so i take it that there's just nothing we can do about it? other than moving all the data off /allen which would mean basically undoing lims.

aamster commented 2 years ago

@matchings correct.

matchings commented 2 years ago

Bummer

mattjdavis commented 2 years ago

10gigabit + fiber install at the institute? But Very $$$….

On Wed, Jun 29, 2022 at 6:58 PM Marina @.***> wrote:

Bummer

— Reply to this email directly, view it on GitHub https://github.com/AllenInstitute/AllenSDK/issues/2306#issuecomment-1170665041, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGTH56C2YGXECJFV5UZEQLVRT5K3ANCNFSM5OV6KFPQ . You are receiving this because you were mentioned.Message ID: @.***>

matchings commented 2 years ago

You’d think we would have something like that already…