Closed KatharineShapcott closed 1 year ago
I got a marginal improvement by writing my own chunking class, but it is really only marginal. @pantaray Any thoughts? I was wondering if chunking over time instead of over channels would give better results because that's the order the data is written in.
from hdmf.data_utils import AbstractDataChunkIterator, DataChunk
class ChannelSplitIterator(AbstractDataChunkIterator):
def __init__(self, data, num_splits):
self.data, self.num_splits = data, num_splits
self.shape = data.shape
self.num_chunks = (self.shape[1])*(1+self.num_splits)
self.num_values = self.shape[0]//self.num_splits
self.__i_chan = 0
self.__i_split = 0
def __iter__(self):
return self
def __next__(self):
if (1+self.__i_chan)*(1+self.__i_split) < self.num_chunks:
if self.__i_split == self.num_splits:
self.__i_split = 0
self.__i_chan += 1
if self.__i_split == self.num_splits - 1:
end = self.shape[0]
else:
end = self.num_values * (self.__i_split+1)
st = self.num_values * self.__i_split
view = np.s_[st:end, self.__i_chan]
self.__i_split += 1
return DataChunk(data=self.data[view],
selection=view)
else:
raise StopIteration
next = __next__
def recommended_chunk_shape(self):
# Here we can optionally recommend what a good chunking should be.
return None # automatic
def recommended_data_shape(self):
# We know the full size of the array. In cases where we don't know the full size
# this should be the minimum size.
return self.shape
@property
def dtype(self):
# The data type of our continuous data
return np.dtype('int16')
@property
def maxshape(self):
# We know the full shape of the array. If we don't know the size of a dimension
# beforehand we can set the dimension to None instead
return self.shape
Hi! Sorry, I'm pretty slow. Wow, the iterator looks really cool! Thanks for all the work @KatharineShapcott ! And yes, I think chunking over time might notably improve things for our case due to the on-disk layout of the data
I looked a bit into speeding up the writing https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/iterative_write.html#example-convert-large-binary-data-arrays but unfortunately I didn't have much luck with the default DataChunkIterator.