Dear @danielhrisca , Please, find my problem described below.

Python version

('python=3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]')
'os=Windows-10-10.0.19045-SP0'
'numpy=1.25.2'
'asammdf=7.3.14'

Code

MDF version

4.10

Code snippet

def getFullChannelList(mdfFile):
    infoPack = mdfFile.info()
    channelTab = []
    for i in range(infoPack['groups']):
        group_key = 'group {}'.format(i)
        groupChannelsTab = []
        for j in range(infoPack[group_key]['channels count']):
            channel_key = 'channel {}'.format(j)
            name_type = infoPack[group_key][channel_key]
            m = re.search(r"name=\"(.*?)\" type=(VALUE|MASTER)", name_type)
            channel_name = m.group(1)
            groupChannelsTab.append((channel_name, mdfFile.iter_get(None,i,j,None,True)))
        channelTab.append(groupChannelsTab)
    return channelTab

def memory_limited_eval(path:str):
    mdf = MDF(path, "minimal")
    data_records = getFullChannelList(mdf)
    stats = {}

    for group in data_records:
        for name, channel in group:

            for element in channel:
                vals = element[0]
                aggregate_statistics(name, vals, stats)

path = "path/to/my/file.mdf4"
memory_limited_eval(path)

Description

I'd like to calculate basic statistics (min, max, mean) on signals extracted from MDF4 files. My MDF4 files are quite large, 10-20GB, while my computer has a memory of 8Gb. As you can see in the code snippet, I was on to iterate through the whole file once, collect all the information, and aggregate it into a dictionary.

According to my tests with smaller files (~3Gb), loading the complete file into memory, iterating the signals of mdf.iter_channels() and then getting signal.samples.min()/max()/mean() values need 70% less execution time, than my code proposed above. I know, calculation on the preloaded arrays can be more effective, but this factor seems unrealistic for me. The aggregation function creates a data record for each signal only once, so the memory allocation overhead could be ignored, I think. The contained min, max, sum, size fields are updated when a corresponding chunk is loaded. So, I assume the bottleneck here is my implementation of iterating through the MDF4 file.
As a confirmation, simply iterating through the whole file as proposed above without aggregation is way slower as well.
Is there any mechanism to insert a callback function, e.g. here to aggregate information, while using a more effective to iterate through the file?
Or is there any mechanism to iterate over all records from the begining to the end, following the order as it has been serialized, reading the next N records to the memory in each step, and doing the aggregation using those chunks?

Sorry for the many questions!

Thank you for your help in advance!

danielhrisca / asammdf

Best practice for iterating through large MDF4 files and collecting statistics #890

Python version

Code

MDF version

Code snippet

Description