MITHaystack / digital_rf

Read, write, and interact with data in the Digital RF and Digital Metadata formats
Other
102 stars 31 forks source link

possibly fixed a bug with caching. added [()] to ensure checksum and … #51

Closed jvierine closed 1 year ago

jvierine commented 1 year ago

…compressed data is faster

ryanvolz commented 1 year ago

I'll have to think a little bit about possible implications of this change in behavior (reading the whole dataset for each newly-accessed file into memory) to see if there are any downsides for typical usage. I think with default settings that this is strictly an improvement, since the dataset is a single chunk and would be read in its entirety by h5py anyway.

It may also be worth investigating caching that is internal to the HDF5 library, i.e. setting rdcc_nbytes to a larger value when opening the file with h5py. To be determined if that method would have similar performance.

I'll also note that I think this is something that is commonly worked around manually by a user managing a data cache themselves by requesting reads of large (whole-file) chunks and then breaking that up for processing. I've never liked that the library doesn't make that easier, and for a long time I've wanted a generator-style API for reading data sequentially in chunks of a specified size that would improve performance by caching and reading ahead.

jvierine commented 1 year ago

I'll have to think a little bit about possible implications of this change in behavior (reading the whole dataset for each newly-accessed file into memory) to see if there are any downsides for typical usage. I think with default settings that this is strictly an improvement, since the dataset is a single chunk and would be read in its entirety by h5py anyway.

The proposed change might introduce a performance penalty for files that do not use compression or checksumming. I should probably test this. Maybe this could be the behavior for compressed files only?

I need to achieve the large speed up for compressed and checksummed files somehow. It is painful to have signal processing code uncompress a 1000000 sample vector 100 times when it only needs to be done once.

I considered writing a wrapper on top of digital rf as you suggested, but this would have required duplicating much of the index calculation already done inside the library.

Could you consider this as a user option? Something like enable_cache_for_compressed_files=True/False. This way the user could optionally request for the faster read method?

It may also be worth investigating caching that is internal to the HDF5 library, i.e. setting rdcc_nbytes to a larger value when opening the file with h5py. To be determined if that method would have similar performance.

From what I understand, this method won't help with files that are already compressed with one chunk per file. They would still require 100 decompress calls to a file instead of the single call that is theoretically needed in my typical application.

jvierine commented 1 year ago

It may also be worth investigating caching that is internal to the HDF5 library, i.e. setting rdcc_nbytes to a larger value when opening the file with h5py. To be determined if that method would have similar performance.

I was wrong. I added a rdcc_nbytes=4000000, which is the size of the raw voltage in one file, and things were fast without the copy to ram operation [()].

                self._cachedFile = h5py.File(fullfile, "r", rdcc_nbytes=4000000)
                self._cachedFilename = fullfile

                self.rf_data = self._cachedFile["rf_data"]
                self.rf_data_len = self.rf_data.shape[0]

                self.rf_index = self._cachedFile["rf_data_index"][...]
                self.rf_index_len = self.rf_index.shape[0]

As this does solve the issue, I agree with you that this is the cleaner way of solving it, as we are using an already existing underlying cache mechanism built into hdf5.

But how do we ensure that a sane value of rdcc_nbytes is used by default? It cannot default to file size, as this can be more than your RAM size. What about setting it to n.min(4000000,file_size_nbytes)?