ElectricalSeries: important chunking considerations

magland commented 1 year ago

This is more of a discussion than an issue.

Buzsaki dataset from 2021 (YutaMouse33-150222) ElectricalSeries data loading in neurosift is very slow especially for large numbers of channels. I looked into the chunking size and got the following:

https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced
Opening file for lazy reading...
Getting data object
Shape of data: (1220698000, 96)
Chunking: (298022, 1)
Getting first 10 rows of data
Shape of data chunk: (10, 96)
Elapsed time for reading data chunk: 21.24241590499878 sec

Compare that with a much more recent dataset (000463 draft)

https://dandiarchive.s3.amazonaws.com/blobs/082/8f8/0828f847-62e9-443f-8241-3960985ddab3
Opening file for lazy reading...
Getting data object
Shape of data: (31021056, 32)
Chunking: None
Getting first 10 rows of data
Shape of data chunk: (10, 32)
Elapsed time for reading data chunk: 0.000518798828125 sec

It appears that in the second case, chunking is turned off, and it makes a huge difference for remote reading of data (well, there must be some caching going on here).

You can also see a dramatic difference in loading time in neurosift:

The slow first one (with chunking): https://flatironinstitute.github.io/neurosift/#/nwb?url=https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced Takes 10 seconds to load the initial view

The fast second one (with no chunking): https://flatironinstitute.github.io/neurosift/#/nwb?url=https://dandiarchive.s3.amazonaws.com/blobs/082/8f8/0828f847-62e9-443f-8241-3960985ddab3 Takes less than 1 second to load the initial view

Even if chunking is needed for very large datasets, I think that the chunks should be much larger in both the channel and time dimensions.

@bendichter

CodyCBakerPhD commented 1 year ago

I don't think it's just chunking that's at play here then. Chunking is a necessary step for enabling compression, which is likely what is actually causing the large slowdown compared to the uncompressed (the un-chunked) data

To confirm we would need to find or upload some example files that are chunked, but not compressed. I doubt any of those exist on the archive so I'll just make one and upload to staging

Note: it is, however, best practice to always use compression when storing data on the archive, and the NWB Inspector should warn people about it (though DANDI validation will not outright prevent upload or publishing). Otherwise the archive would be much larger than it is today

CodyCBakerPhD commented 1 year ago

BTW out of curiosity, is there any parallelization going on right now w.r.t. data access calls? While decompression isn't super heavy computationally it might be beneficial to have multiple chunks requested and decompressed simultaneously, but not familiar enough with how Type/JavaScript interfaces with things like shared memory

magland commented 1 year ago

@CodyCBakerPhD to confirm that the first is using compression and the second is not (I added a couple of fields in the Python output):

First (slow):

https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced
Opening file for lazy reading...
Getting data object
Shape of data: (1220698000, 96)
Chunking: (298022, 1)
Compression: gzip
Compression opts: 4
Getting first 10 rows of data
Shape of data chunk: (10, 96)
Elapsed time for reading data chunk: 9.983946084976196 sec

Second (fast):

https://dandiarchive.s3.amazonaws.com/blobs/082/8f8/0828f847-62e9-443f-8241-3960985ddab3
Opening file for lazy reading...
Getting data object
Shape of data: (31021056, 32)
Chunking: None
Compression: None
Compression opts: None
Getting first 10 rows of data
Shape of data chunk: (10, 32)
Elapsed time for reading data chunk: 0.0004906654357910156 sec

Do you think there's a way (in the Python code) to test how much time compression is taking, vs network requests? I suppose one way is to download the file and try it locally -- but the first file is very large so that's inconvenient.

For reference, this is the Python script I am using

import time
import numpy as np
import fsspec
import h5py
from fsspec.implementations.cached import CachingFileSystem

# nwb_object_id = 'c86cdfba-e1af-45a7-8dfd-d243adc20ced'
nwb_object_id = '0828f847-62e9-443f-8241-3960985ddab3'
s3_url = f'https://dandiarchive.s3.amazonaws.com/blobs/{nwb_object_id[:3]}/{nwb_object_id[3:6]}/{nwb_object_id}'
print(s3_url)

fs = CachingFileSystem(
    fs=fsspec.filesystem("http"),
    # cache_storage="nwb-cache",  # Local folder for the cache
)

print('Opening file for lazy reading...')
f = fs.open(s3_url, "rb")
file = h5py.File(f)
print('Getting data object')
x = file['acquisition']['ElectricalSeries']['data']
print(f'Shape of data: {x.shape}')
print(f'Chunking:', x.chunks)
print(f'Compression:', x.compression)
print(f'Compression opts:', x.compression_opts)
print('Getting first 10 rows of data')
timer = time.time()
data_chunk = x[:10, :]
print(f'Shape of data chunk: {data_chunk.shape}')
elapsed_sec = time.time() - timer
print(f'Elapsed time for reading data chunk: {elapsed_sec} sec')

Regarding parallelization in the browser, all the hdf5 access takes place in a worker thread... and that worker thread makes synchronous http requests and gunzip (and I believe that's a requirement for using h5wasm), so there is no concurrency there, within a single worker. However, I am using 2 worker threads, so that allow some concurrency at a higher level.

Is it the network latency or the decompression? - I'm just not sure.

Tagging @garrettmflynn

CodyCBakerPhD commented 1 year ago

Rigorous performance testing is a very nuanced topic; this is something we want to really sink our teeth into pending a grant we've applied for related to cloud computing, but since we haven't started that yet I'll just keep things as general as I can to avoid making incorrect claims about specificities

In general we follow the recommendations of the HDF5 Group on optimizing I/O, but I'll admit that white paper was put together quite a while ago and was likely not tuned specifically for our S3 situation

Do you think there's a way (in the Python code) to test how much time compression is taking, vs network requests?

The most sophisticated way would be to do a full profile across the underlying h5py (decompression) / fsspec (streaming calls) callback trace but those reports are often quite hard to read and require extremely in depth knowledge about those packages - favorite package for this is cProfile

The easiest way would be what I'm preparing for you now, which is to use Python timing similar to how you're currently doing it (with one difference elaborated below) but applied to several copies of the same data. Across these copies you have the same exact data written

(i) chunked and compressed (ii) chunked identically to ii but not compressed (iii) not chunked or compressed at all

and you then test and compare timings of the same exact access patterns across each separate copy

The access pattern itself will also vary a lot between the 3 cases; the read pattern of (10, 96) you show above would of course work much better on any unchunked dataset because you're requesting random access of a 1.92 KB piece from what would otherwise subset from a 0.6 MB chunk - and while I'm not 100% positive that the entire chunk is returned when you access the data region from case i or ii, I suspect that may be the case -

So I'd also recommend testing against a number of read patterns that both subset the chunk shape, are equal to the chunk shape, and span multiple chunks to get fairer comparisons

Technically the type and/or options of compression can also make a big difference, which was the topic of a summer project by the LBNL team last summer, led by @rly. I'll set that aside for now since it gets even more complicated than this simpler discussion

For reference, this is the Python script I am using

Thanks for sharing the code - I'm going to point you towards this section of a blog about using timeit in Python, but feel free to check out the rest if interested

The short version is that timing things according to the time is not always accurate since it captures global system runtime, not the specific time it took the particular operation; timeit can also let you repeat the speed test some number of times so you can average it out in the end to reduce stochasticity (though remember to set unique cache locations each time to avoid subsequent runs pulling from cache instead of retesting the stream time)

Is it the network latency or the decompression? - I'm just not sure.

There's also potential I/O bottleneck if there's any caching to disk involved (which I see your example code is using a CachingFileSystem with fsspec) - that could especially be experienced if more CPUs were thrown at the requests (and the bandwidth was confirmed to not be the current bottleneck)

CodyCBakerPhD commented 1 year ago

From @bendichter

Paging (aka sharding) may help with this but needs to be tested

Sure I can make a copy of each (i-iii) that have paging set (assuming I can get it to pass via H5DataIO)

magland commented 1 year ago

Thanks for doing this @CodyCBakerPhD . I think it will be important to put the datasets in a cloud bucket to test the remote access. Also, they should be large enough to prevent artificially high level of cache hits when downloading chunks of the file (sorry if that is not expressed clearly)

CodyCBakerPhD commented 1 year ago

Also, they should be large enough to prevent artificially high level of cache hits when downloading chunks of the file (sorry if that is not expressed clearly)

Yeah I'm copying the entire ElectricalSeries data from https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced as a simpler TimeSeries, still in acquisition, with the different cases illustrated as the file names. Will let you know and send you links when they're ready

CodyCBakerPhD commented 1 year ago

@magland The new 2 examples (chunking and uncompressed, no chunking and no compression) should be up by morning at https://gui-staging.dandiarchive.org/dandiset/200560/draft/files?location=, a folder called 'for_jeremy'

Two fs_strategy=page versions for of each of the 3 file types will be uploaded once I wake up tomorrow - one will have fs_page_size= 4 KiB (the default) and the other will have fs_page_size = 4 MiB just to see if that makes any difference. File names should indicate everything of importance

Where applicable, all chunking and compression parameters are equivalent to the source data

magland commented 1 year ago

Thanks @CodyCBakerPhD !

Here are some timing results. But a couple notes first

It seems that the not_chunked_and_not_compressed actually does have chunking of (707, 96) (see output below)
I tried to use %timeit magic command in jupyter notebook, but it was not suitable since after the first run it was essentially instantaneous for loading the chunk. I think the traditional method of timing is okay here because it seems to be network-transfer limited.

import time
import numpy as np
import fsspec
import h5py
from fsspec.implementations.cached import CachingFileSystem

## see https://gui-staging.dandiarchive.org/dandiset/200560/draft/files?location=for_jeremy
s3_urls = {
    'chunked_but_not_compressed': 'https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/738/069/73806960-88ff-4d5e-9920-40d43186cf26',
    'not_chunked_and_not_compressed': 'https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/0c1/51d/0c151db4-27db-4c2f-8290-0626a66526c3'
}

fs = CachingFileSystem(
    fs=fsspec.filesystem("http")
)

timeseries_objects = {}

for k, s3_url in s3_urls.items():
    print('====================================')
    print(k)
    print('Opening file for lazy reading...')
    f = fs.open(s3_url, "rb")
    file = h5py.File(f)
    print('Getting data object')
    x = file['acquisition']['TimeSeries']['data']
    print(f'Shape of data: {x.shape}')
    print(f'Chunking:', x.chunks)
    print(f'Compression:', x.compression)
    print(f'Compression opts:', x.compression_opts)
    print('Getting first 10 rows of data')
    timeseries_objects[k] = x

# output
====================================
chunked_but_not_compressed
Opening file for lazy reading...
Getting data object
Shape of data: (1220698000, 96)
Chunking: (298022, 1)
Compression: None
Compression opts: None
Getting first 10 rows of data
====================================
not_chunked_and_not_compressed
Opening file for lazy reading...
Getting data object
Shape of data: (1220698000, 96)
Chunking: (707, 96) # <--------- still using chunking
Compression: None
Compression opts: None
Getting first 10 rows of data

for k, timeseries_object in timeseries_objects.items():
    print('====================================')
    print(f'Timing reading data for {k}')
    x = timeseries_object
    # timeit doesn't work here because the result is cached after the first run
    timer = time.time()
    data_chunk = x[:10, :]
    print(f'Shape of data chunk: {data_chunk.shape}')
    elapsed_sec = time.time() - timer
    print(f'Elapsed time for reading data chunk: {elapsed_sec} sec')

# output
====================================
Timing reading data for chunked_but_not_compressed
Shape of data chunk: (10, 96)
Elapsed time for reading data chunk: 28.612600326538086 sec
====================================
Timing reading data for not_chunked_and_not_compressed # <----- I think it actually is chunked (see above)
Shape of data chunk: (10, 96)
Elapsed time for reading data chunk: 7.263010501861572 sec

So there is a significant difference between the two schemes, even though both were not compressed.

Looking at my network monitor, it seems the difference is the amount of data transferred - and this will depend on the relationship between the chunking scheme of fsspec.filesystem and the chunking of the timeseries in the hdf5.

This difference is consistent with what I am seeing in neurosift in the browser

chunked_but_not_compressed http://localhost:3000/neurosift/?p=/nwb&url=https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/738/069/73806960-88ff-4d5e-9920-40d43186cf26

not_chunked_and_not_compressed <------ I think it actually is chunked http://localhost:3000/neurosift/?p=/nwb&url=https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/0c1/51d/0c151db4-27db-4c2f-8290-0626a66526c3

CodyCBakerPhD commented 1 year ago

@magland The paginated versions of each combination are up (ignore the 'unchunked' ones of course, regenerating those now). Also includes a compressed + paginated version with same chunking as original data

It seems that the not_chunked_and_not_compressed actually does have chunking of (707, 96) (see output below)

Looks like I used the wrong approach to disable chunking - regenerating now

I tried to use %timeit magic command in jupyter notebook, but it was not suitable since after the first run it was essentially instantaneous for loading the chunk.

Yeah, as I mentioned above, would need to remember to set unique cache locations each time to avoid subsequent runs pulling from cache instead of retesting the stream time. Or not enable caching at all since we're really trying to test the bandwidth speed vs. other factors

I still strongly recommend the timeit approach, with a proper setup command to avoid conflation with other setup delays

So there is a significant difference between the two schemes, even though both were not compressed.

Yes, with respect to the read access pattern (10, 96). I'd recommend trying a wider range of access patterns representing real life situations, such as zooming in on several localized waveforms (smaller range of channels, maybe ~50 frame slice or so), monitoring global variations over longer time scales (num_seconds * sampling_rate number of frames, all channels, where num_seconds > 5 or so), and anything else you can think of

Despite my mistake, this is actually useful. What happened was the write procedure defaulted to using the h5py chunking scheme instead of entirely disabling chunking as intended. The default h5py schema, despite the HDF5 teams recommendations, actually generates much smaller chunk sizes (~0.13 MB seen here, opposed to the ~0.6 MB of the channel-wise chunked file)

So the difference in speed reading uncompressed data is in line with my hypothesis that data access requests for any sub-slice of a chunk streams the entire chunk worth of data. Not saying this is absolute confirmation but it could certainly explain the current observation

magland commented 1 year ago

Thanks! I agree that we'll want to look at the various read access patterns. For sanity (this is getting pretty complicated), I suggest we focus first on one single measure: how fast is the initial load of the traces in neurosift? For that question there is a very clear winner:

unchunked_anduncompressed* (which actually is the default chunking from h5py) beats all the others by a long shot. In fact, right now, neurosift is set to only load the first 15 channels... and the difference is: waiting for around 15 seconds, and 10MB data download on the one hand, and waiting for around 1 second and less than 1MB download on the other. If I adjust the app to try to load ALL the channels, the one goes for more than 100 seconds before I closed the tab, whereas the other is just a few seconds.

So, before getting into other use cases, there is an overwhelming winner here from the perspective of initial load of a small amount of data (which is highest priority for me). The question remains whether it would still be performant if we added compression to the default h5py chunking.

Lazy loading via Python is a different consideration, which is important, but I'm hoping to focus first on the neurosift performance.

Also, the different pagination schemes don't seem to make an obvious difference - at least when loading into neurosift.

CodyCBakerPhD commented 1 year ago

Thanks! I agree that we'll want to look at the various read access patterns. For sanity (this is getting pretty complicated)

That's what I was trying to tell you lol 😆 this is a very deep rabbit hole and premature optimization can be detrimental to rapid dev momentum

Just constraining to rough neurosift app speeds for now is fine by me

I have now reuploaded (replaced) the filenames with properly unchunked versions, though I had to constrain it to the first ~100 seconds or so because any attempt to buffer the data write triggered chunking to be enabled - curious what you notice on the speeds when data is completely unchunked (though I notice I can't seem to control the second dimension of the TimeSeries view)

For the final round of fair use case tests I'll generate some with the default h5py chunking as well as modern NeuroConv chunking (differs from the old source data), both compressed and uncompressed (and name the files accordingly). Probably skip pagination for now on those

So, before getting into other use cases, there is an overwhelming winner here from the perspective of initial load of a small amount of data (which is highest priority for me).

The last thing I'll mention here is, if we accept my notion of the correlation between streaming speed and chunk size, then we are left with the following argument

a) published data assets have their dataset options, including chunking, frozen and cannot be changed to optimize streaming speeds b) we can make changes to NeuroConv and related default behaviors, which will produce new dandisets with new files with more optimized streaming behavior, but that will not happen overnight c) while the unchunked examples are interesting to play with, I really doubt the DANDI team would be OK with doubling (or more) the size of the archive - compression of large data is likely non-negotiable, which then requires chunking, which we can only work to optimize the shape and sharding thereof d) we both want neurosift to be able to visualize contents as efficiently as the source allows (given non-optimal chunking, as well as compression)

Towards the aim of (d) given the constraints of (a-c) I'm driven to the following conclusion: neurosift should attempt to work as best as it can around the chunk shape instead of being tuned to specific target ranges, even if those targets are more 'ideal' in general

What I mean by this is, if streaming requests transfer an entire chunk worth of data even if you didn't intend to use all the data in that chunk, the way to improve performance is for default rendering to make the most of what's been returned.

That is, instead of always having a default view that targets say (5 seconds, 15 channels) - which would only work well when source chunks are more roughly (~moderate amount of time, ~small amount of channels) as opposed to poor performance seen with source chunks (~small amount of time, all channels) or (~large amount of time, 1 channel) - instead make the default view always equal in shape to the first chunk or first few chunks that add up to some very small data threshold (such as ~5 MB), thus ensuring the users bandwidth is being utilized as much as possible upon the first view of the data.

What would you think of that approach?

Granted, if the user then selects different channel ranges or scans/plays or zooms in/out over time, that's when we get sent to the broader discussion of access patterns that can wait for another day, but still not too much we can do right away given the constraints (a-c)

magland commented 1 year ago

Thanks @CodyCBakerPhD, what you say makes sense. A lot of this is pretty mysterious to me. I'll follow up on Monday.

magland commented 1 year ago

Hi @CodyCBakerPhD

There's a lot to discuss here - To start, jere are some Python timings based on the new data you uploaded, and your suggestion to look at reading other sizes of data.

I ran each timing twice to get at least some idea of the variability

For reading (10, 96) segment:

====================================
Timing reading data for chunked_but_not_compressed
Shape of data chunk: (10, 96)
Elapsed time for reading data chunk: 48.093024492263794 sec / 61.54763746261597 sec
====================================
Timing reading data for chunked_and_compressed_page_fs_strategy_4_KiB
Shape of data chunk: (10, 96)
Elapsed time for reading data chunk: 30.916049242019653 sec / 31.682336568832397 sec
====================================
Timing reading data for default_h5py_chunking_not_compressed <--- this is no longer in the folder you shared
Shape of data chunk: (10, 96)
Elapsed time for reading data chunk: 6.520519971847534 sec / 10.449405670166016 sec
====================================
Timing reading data for unchunked_and_uncompressed_page_fs_strategy_4_KiB
Shape of data chunk: (10, 96)
Elapsed time for reading data chunk: 0.00031256675720214844 sec / 0.0003273487091064453 sec

For reading (298022, 96) segment

====================================
Timing reading data for chunked_but_not_compressed
Shape of data chunk: (298022, 96)
Elapsed time for reading data chunk: 37.758856773376465 sec / 37.91836762428284 sec
====================================
Timing reading data for chunked_and_compressed_page_fs_strategy_4_KiB
Shape of data chunk: (298022, 96)
Elapsed time for reading data chunk: 19.3127863407135 sec / 25.656846284866333 sec
====================================
Timing reading data for default_h5py_chunking_not_compressed <--- this is no longer in the folder you shared
Shape of data chunk: (298022, 96)
Elapsed time for reading data chunk: 36.85222601890564 sec / 35.99728536605835 sec
====================================
Timing reading data for unchunked_and_uncompressed_page_fs_strategy_4_KiB
Shape of data chunk: (298022, 96)
Elapsed time for reading data chunk: 17.922171592712402 sec / 17.13180422782898 sec

I chose (298022, 96) so that the 298022 matches the chunking size for the first two examples.

bendichter commented 1 year ago

There's a great youtube video on this: https://www.youtube.com/watch?v=rcS5vt-mKok We can try to implement some of these features

CodyCBakerPhD commented 1 year ago

First steps on this PR: https://github.com/hdmf-dev/hdmf/pull/925

Exposure of pagination (setting, larger defaults than h5py has, and larger page buffer as well) to follow

Partial chunk shaping given a size limit has been requested and will give higher priority: https://github.com/catalystneuro/neuroconv/issues/17 (the ability to say 'chunk this SpikeGLX series by (None, 192) up to 5 MB' or 'chunk this 4D image by (1, None, None, None) up to 3 MB', etc. and it will fill the None axes as much as possible up to the size)

magland commented 1 year ago

I hesitate to add yet another consideration here, but...

There is a tricky interplay between the emscripten lazy (virtual) file that h5wasm uses and the chunking and data layout in the remote h5 file. You need to specify a chunk size for the virtual file (completely different from the h5 dataset chunking). If too small you end up with too many GET requests and a low transfer rate due to the latency of each request. If too large you end up downloading a lot more data than is actually needed. For example, if the virtual file chunk size is 1 mb, and you only need to read the attributes of a h5 group, then you waste nearly the entire 1MB request. On the other hand, if the chunk size is 10 kb, then you need thousands of individual requests to download a 10 mb slice of a dataset. The problem is, when implementing the virtual file (backed by the synchronous http requests in the worker thread) you don't know how many bytes are being requested in any given read operation. You only get a request to read the data starting at a certain file position. This makes it impossible to distinguish between when the system is requesting small or large blocks of data. And even if the virtual file chunk size matched appropriately with the requested chunk size for datasets, there's no reason to expect that the chunks will align efficiently.

Basically what I'm saying is that the h5wasm implementation is not as simple as requesting the needed data ranges in the h5 file. There is a virtual file chunking system at play, and it's not very flexible.

I have found that using 100 kb virtual file chunk size is about optimal for these datasets, so that's what I am using now.

One issue is that all the meta information (attributes of groups and datasets, plus the data for small datasets) needs to be read very rapidly on page load. This is not efficient if that meta data is scattered randomly throughout a very large nwb file. To remedy this, I have been precomputing a meta-only version of the nwb file and storing it on a separate server. That file ends up being 2-10 mb in size (usually). And so for that I use a larger virtual file chunk size (4 mb), and it loads all that initial information very quickly.

For more info on how I am caching a meta-only version of the nwb files, see https://github.com/flatironinstitute/neurosift/issues/7

bendichter commented 1 year ago

@magland

One issue is that all the meta information (attributes of groups and datasets, plus the data for small datasets) needs to be read very rapidly on page load. This is not efficient if that meta data is scattered randomly throughout a very large nwb file. To remedy this, I have been precomputing a meta-only version of the nwb file and storing it on a separate server. That file ends up being 2-10 mb in size (usually). And so for that I use a larger virtual file chunk size (4 mb), and it loads all that initial information very quickly.

It's fine to store intermediate files as a stop-gap but ultimately we'd like to have default file settings that are optimized for cloud access so these types of methods are unnecessary. We have been experimenting with packing h5 files so that all of this metadata is together and can be read one few requests. Can you point me to a file that is giving you trouble in this category? I'll use that to test this new approach.

magland commented 1 year ago

@bendichter

It's fine to store intermediate files as a stop-gap but ultimately we'd like to have default file settings that are optimized for cloud access so these types of methods are unnecessary. We have been experimenting with packing h5 files so that all of this metadata is together and can be read one few requests. Can you point me to a file that is giving you trouble in this category? I'll use that to test this new approach.

There are certainly more extreme examples, but here's one where there is a noticeable difference in the initial load time:

Without using my meta nwb (slow):

http://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced&no-meta=1

With using my meta nwb (fast):

http://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced&no-meta=0

(note the no-meta=1 query parameter)

For me the difference is 6 seconds versus <1 second.

magland commented 1 year ago

@bendichter @CodyCBakerPhD

I uploaded some ephys datasets to DANDI. Notice how fast this loads when I use no compression and no chunking.

https://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandiarchive.s3.amazonaws.com/blobs/638/80a/63880a91-ca46-4ab8-ba07-bf4ab044145c&tab=neurodata-item:/acquisition/ElectricalSeries|ElectricalSeries

I would expect similar performance with large chunks (covering all channels) and no compression.

The next best option (IMO) would be large chunks (covering all channels) and compression (although this is not ideal for viewing small time segments, random access -- so the chunks should be around 5-10 MB IMO)

The least cloud-optimal option is small chunks (or chunks covering a small number of channels) with or without compression - or very large chunks with compression.

CodyCBakerPhD commented 1 year ago

The next best option (IMO) would be large chunks (covering all channels)

Even for NeuroPixels? By my calculations, a ~10MB chunk over all 384 channels would only cover about 1/3 of a second of equivalent real-time data.

I'm about to repackage some of the raw data from 409, which is multiprobe NP - what chunk shapes would you like me to try with that? I was thinking of trying at least

(54613, 96) (close to 2s) (27307, 192) (closer to 1s) (13653, 384)

magland commented 1 year ago

The next best option (IMO) would be large chunks (covering all channels)

Even for NeuroPixels? By my calculations, a ~10MB chunk over all 384 channels would only cover about 1/3 of a second of equivalent real-time data.

I'm about to repackage some of the raw data from 409, which is multiprobe NP - what chunk shapes would you like me to try with that? I was thinking of trying at least

(54613, 96) (close to 2s) (27307, 192) (closer to 1s) (13653, 384)

Oh yeah, I wasn't thinking of the large channel counts (I mainly work with 32 ch). I guess this depends on whether people will usually want to access all the channels or a subset. This is really difficult - I don't know a good way here. :)

CodyCBakerPhD commented 1 year ago

With 32 ch, how big of a time window would you ever want to look at?

magland commented 1 year ago

With 32 ch, how big of a time window would you ever want to look at?

Usually just a few seconds, but perhaps up to 30 seconds.

CodyCBakerPhD commented 1 year ago

Usually just a few seconds, but perhaps up to 30 seconds.

Thanks, that's helpful

I'll add

(81920, 64) (~2.7s) (163840, 32) (~5.4s) (327680, 16) (~11s)

to experiment with

magland commented 1 year ago

@bendichter Here's a more dramatic example highlighting the need for the meta nwb file

https://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandiarchive.s3.amazonaws.com/blobs/fd9/602/fd9602fd-77d5-42ad-8ea9-08fdc9f08226&no-meta=1

that takes around 20 seconds to load... compared with just a second for this one:

https://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandiarchive.s3.amazonaws.com/blobs/fd9/602/fd9602fd-77d5-42ad-8ea9-08fdc9f08226&no-meta=0

It's because of the 188 acquisition items

CodyCBakerPhD commented 1 year ago

that takes around 20 seconds to load

which file on DANDI is that? I can try repacking it really quick to see if paginated metadata solves the initial loading problem

magland commented 1 year ago

that takes around 20 seconds to load

which file on DANDI is that? I can try repacking it really quick to see if paginated metadata solves the initial loading problem

It's this: https://dandiarchive.org/dandiset/000615/draft/files?location=sub-001 / sub-001_ses-20220411.nwb

CodyCBakerPhD commented 1 year ago

@magland OK here is the example of that one paginated at 10 MiB: https://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/c61/76f/c6176f66-c606-44eb-8d68-85ef71c3808d

located in https://gui-staging.dandiarchive.org/dandiset/200560/draft/files?location=for_jeremy

Is that any faster for you?

One other thing is I do not know how or if h5wasm exposes page buffer size, which can also have a big impact on how these are read

The analog in h5py is the page_buf_size at this line: https://github.com/h5py/h5py/pull/1967/files#diff-6515aaa459980029aa5ae782298c82cacb68f8f6f3218a9f0419f7148bd80f30R357

and it must be set to a multiple of the page layout to work well; I'd recommend page_buf_size=20971520 here, if possible

magland commented 1 year ago

@CodyCBakerPhD That makes a huge difference!

Your paginated version loads in just a couple seconds for me https://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/c61/76f/c6176f66-c606-44eb-8d68-85ef71c3808d&no-meta=1

(notice no-meta=1)

compared with this slow one

https://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandiarchive.s3.amazonaws.com/blobs/fd9/602/fd9602fd-77d5-42ad-8ea9-08fdc9f08226&no-meta=1

The meta nwb method still seems to be a bit faster than the paginated one, but both are fast enough.

CodyCBakerPhD commented 1 year ago

That makes a huge difference!

Great to hear 👍 that alone is a pretty easy change to make aside from the chunking problem so we'll start incorporating that across the ecosystem

CodyCBakerPhD commented 1 year ago

The new round of NeuroPixel examples are up for qualitative evaluation

The data stream that was repacked was the ElectricalSeriesAP from https://dandiarchive.org/dandiset/000409/draft/files?location=sub-CSH-ZAD-001%2F

The naming convention is -chunking-{num_frames}-{num_channels} as outlined above

Let me know which you prefer so we can update that as the default for that and similar formats

magland commented 1 year ago

This is great, thank you @CodyCBakerPhD !

They all seem to provide a better neurosift experience than the previous default especially when num_channels per chunk >= 32. The ~10 MB chunks work well I think.

Here's a table showing how many chunks are needed for various reasonable loading patterns. The columns are the number of channels in a chunk.

	1	16	32	64	96	192	384
1ch 1tp	1	1	1	1	1	1	1
1ch 50ms	1	1	1	1	1	1	1
1ch 1sec	1	1	1	1	1	2	3
1ch 10sec	1	1	1	1	1	2	3
32ch 1ms	32	2	1	1	1	1	1
32ch 50ms	32	2	1	1	1	1	1
32ch 1sec	32	2	1	1	1	2	3
32ch 10sec	32	2	1	1	1	2	3
128ch 1ms	128	8	4	2	2	1	1
128ch 10ms	128	8	4	2	2	1	1
128ch 50ms	128	8	4	2	2	1	1
128ch 1sec	128	8	4	2	2	2	3

From this it seems like 64, 96, or 192 are all good choices. The middle one is 96, but I'd be happy with 64 as well.

One note. These are 10 MB chunks from the perspective of the uncompressed data. The compressed chunks will be smaller. In neurosift on my wi-fi laptop, each chunk takes around 5 seconds to load, so I think I wouldn't want to go higher than that.

Here's the code to generate the table above

import math

ff = 30000

ncs = [1, 16, 32, 64, 96, 192, 384]
rows = [
    ('1ch 1tp', 1, 1),
    ('1ch 50ms', 1, 0.1*ff),
    ('1ch 1sec', 1, 1*ff),
    ('1ch 10sec', 1, 1*ff),
    ('32ch 1ms', 32, 0.001*ff),
    ('32ch 50ms', 32, 0.05*ff),
    ('32ch 1sec', 32, 1*ff),
    ('32ch 10sec', 32, 1*ff),
    ('128ch 1ms', 128, 0.001*ff),
    ('128ch 10ms', 128, 0.01*ff),
    ('128ch 50ms', 128, 0.05*ff),
    ('128ch 1sec', 128, 1*ff),
]

print('| | ' + ' | '.join([str(x) for x in ncs]) + ' |')
print('| --- |' + ' | '.join(['---' for x in ncs]) + ' |')
for r in rows:
    a = []
    for nc in ncs:
        nf = (10 * 1024 * 1024 / (nc * 2))
        a.append(math.ceil(r[2] / nf) * math.ceil(r[1] / nc))
    print('| ' + r[0] + '|' + ' | '.join([str(x) for x in a]) + ' |')

CodyCBakerPhD commented 1 year ago

@magland I had to go into DANDI set 59 to fix some unrelated things, so while I was there I implemented all the practices suggested here

Try out any/all of the files here: https://dandiarchive.org/dandiset/000059/0.230907.2101/files?location=

The raw ones are not paginated as it inflated file size ~20%, pretty much undoing any gains from compression

The processed ones are paginated at 10 MiB

All chunk shapes for all data streams are around ~10 MiB, with the ElectricalSeries in particular chunked by 64 channels

magland commented 1 year ago

Thanks @CodyCBakerPhD

I took a look at a few of them, and they do seem pretty responsive.

CodyCBakerPhD commented 7 months ago

Can this be closed? These examples and greater discussion are making their way to https://github.com/NeurodataWithoutBorders/nwb_benchmarks

Also NeuroConv defaults have used 64 channel max up to 10 MB total size for a while now

magland commented 7 months ago

closing.

flatironinstitute / neurosift

ElectricalSeries: important chunking considerations #52