KamitaniLab / bdpy

Python package for brain decoding analysis (BrainDecoderToolbox2 data format, machine learning analysis, functional MRI)
MIT License
33 stars 22 forks source link

Idea for improving the speed of Features().get() #73

Closed ganow closed 9 months ago

ganow commented 10 months ago

Motivation

Currently, if the number of features we want to load is large, the following code would take a long time:

feature_name = ...
stimulus_name = ...
features_store = Features("/path/to/features")
features = features_store.get(feature_name, label=stimulus_name)  # suppose len(stimulus_name) is large

This is because dataform.Features loads each .mat file sequentially.

https://github.com/KamitaniLab/bdpy/blob/c01a4069d0906d7beb43084b625507649c80e5a9/bdpy/dataform/features.py#L195-L204

We want to improve the speed of data loading for many stimuli.

Idea

We can use multiprocessing. For example,

from multiprocessing import Pool

def _load_data(path):
    try:
        return sio.loadmat(path)['feat']
    except NotImplementedError:
        return hdf5storage.loadmat(path)['feat']

class Features:
    def get(self, layer=None, label=None):
        ...
        path_iterator = map(lambda label: self.__feature_file_table[layer][label], labels)
        with Pool(processes=n_parallel) as pool:
            features = np.concatenate(pool.map(_load_data, path_iterator), axis=0)
        ...

Quick experiment

The original implementation took ~20 seconds to load 128 stimuli in my environment. The proposed implementation took ~3 seconds when n_parallel=16.