Open bendichter opened 4 months ago
@magland I'm trying to put together an example of a search over brain regions for electrophysiology. This works, but it takes about 15 minutes for me on the IBL dataset (to be fair, it's a very big dataset). Is there a faster way, or is this what you would recommend?
import lindi
from tqdm import tqdm
from dandi.dandiapi import DandiAPIClient
brain_area = "MB"
dandi_api_client = DandiAPIClient()
dandiset_id = "000409"
dandiset = dandi_api_client.get_dandiset(dandiset_id)
elec_loc_path = 'general/extracellular_ephys/electrodes/location'
passing_assets = []
for asset in tqdm(list(dandiset.get_assets())):
if not asset.path.endswith("nwb"):
continue
#s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
lindi_url = f'https://lindi.neurosift.org/dandi/dandisets/{dandiset_id}/assets/{asset.identifier}/nwb.lindi.json'
lindi_file = lindi.LindiH5pyFile.from_lindi_file(lindi_url, local_cache=local_cache)
if elec_loc_path in lindi_file and brain_area in lindi_file[elec_loc_path]:
passing_assets.append(asset)
For comparison, this took 18:30
import lindi
from tqdm import tqdm
from dandi.dandiapi import DandiAPIClient
import remfile
import h5py
brain_area = "V3"
dandi_api_client = DandiAPIClient()
dandiset_id = "000409"
dandiset = dandi_api_client.get_dandiset(dandiset_id)
elec_loc_path = 'general/extracellular_ephys/electrodes/location'
passing_assets = []
for asset in tqdm(list(dandiset.get_assets())):
if not asset.path.endswith("nwb"):
continue
s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
rem_file = remfile.File(s3_url)
h5_file = h5py.File(rem_file, "r")
if elec_loc_path in h5_file and brain_area in h5_file[elec_loc_path]:
print(passing_assets)
passing_assets.append(asset)
Actually, I am surprised there isn't a bigger time difference
Yeah that is surprising that the lindi method is not faster. The only part that takes any time is the download. Each file is only 1-2 MB. That should take less than a second per file, but it will depend on the network connection.
I tried it on a gh codespaces instance and it took 506 seconds (8:26 minutes) for 677 files (115 passing).
I then tried with the remfile method -- I didn't run it to completion but it was going at around 2.2 seconds per file. So slower, but not hugely.
Where lindi would provide a much bigger advantage, I am speculating, is when you are loading more information per file -- like for example using pynwb where a lot more metadata needs to be loaded.
Another possibility is to prepare a single large lindi file for the entire dandiset. Then it could be loaded much more efficiently.
Yet another possibility is to cache the lindi files locally.
At NeuroDataReHack, one of the students wanted to identify sessions within the IBL dataset that contained electrodes in a specific brain region. That's doable with the DANDI API, remfile, and pynwb, but can take a very long time because it requires streaming and initializing each NWB file. I think it would be much faster to do it with LINDI, particularly since the metadata they needed was stored in the json file as base64. It would be great if we had a tutorial that demonstrated how to use LINDI in this way. I think it could reduce search time substantially and would be a cool use-case for LINDI.