Speeding up the writing

KatharineShapcott commented 2 years ago

I looked a bit into speeding up the writing https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/iterative_write.html#example-convert-large-binary-data-arrays but unfortunately I didn't have much luck with the default DataChunkIterator.

KatharineShapcott commented 2 years ago

I got a marginal improvement by writing my own chunking class, but it is really only marginal. @pantaray Any thoughts? I was wondering if chunking over time instead of over channels would give better results because that's the order the data is written in.

from hdmf.data_utils import AbstractDataChunkIterator, DataChunk

class ChannelSplitIterator(AbstractDataChunkIterator):

    def __init__(self, data, num_splits):
        self.data, self.num_splits = data, num_splits
        self.shape = data.shape
        self.num_chunks = (self.shape[1])*(1+self.num_splits)
        self.num_values = self.shape[0]//self.num_splits

        self.__i_chan = 0
        self.__i_split = 0

    def __iter__(self):
        return self

    def __next__(self):

        if (1+self.__i_chan)*(1+self.__i_split) < self.num_chunks:
            if self.__i_split == self.num_splits:
                self.__i_split = 0
                self.__i_chan += 1

            if self.__i_split == self.num_splits - 1:
                end = self.shape[0]
            else:
                end = self.num_values * (self.__i_split+1)
            st = self.num_values * self.__i_split

            view = np.s_[st:end, self.__i_chan]

            self.__i_split += 1

            return DataChunk(data=self.data[view],
                             selection=view)
        else:
            raise StopIteration

    next = __next__

    def recommended_chunk_shape(self):
        # Here we can optionally recommend what a good chunking should be.
        return None # automatic

    def recommended_data_shape(self):
        # We know the full size of the array. In cases where we don't know the full size
        # this should be the minimum size.
        return self.shape

    @property
    def dtype(self):
        # The data type of our continuous data
        return np.dtype('int16')

    @property
    def maxshape(self):
        # We know the full shape of the array. If we don't know the size of a dimension
        # beforehand we can set the dimension to None instead
        return self.shape

pantaray commented 2 years ago

Hi! Sorry, I'm pretty slow. Wow, the iterator looks really cool! Thanks for all the work @KatharineShapcott ! And yes, I think chunking over time might notably improve things for our case due to the on-disk layout of the data

esi-neuroscience / oephys2nwb

Speeding up the writing #5