danielhrisca / asammdf

Fast Python reader and editor for ASAM MDF / MF4 (Measurement Data Format) files
GNU Lesser General Public License v3.0
633 stars 224 forks source link

Best practice for iterating through large MDF4 files and collecting statistics #890

Closed kozmad closed 11 months ago

kozmad commented 1 year ago

Dear @danielhrisca , Please, find my problem described below.

Python version

('python=3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]')
'os=Windows-10-10.0.19045-SP0'
'numpy=1.25.2'
'asammdf=7.3.14'

Code

MDF version

4.10

Code snippet

def getFullChannelList(mdfFile):
    infoPack = mdfFile.info()
    channelTab = []
    for i in range(infoPack['groups']):
        group_key = 'group {}'.format(i)
        groupChannelsTab = []
        for j in range(infoPack[group_key]['channels count']):
            channel_key = 'channel {}'.format(j)
            name_type = infoPack[group_key][channel_key]
            m = re.search(r"name=\"(.*?)\" type=(VALUE|MASTER)", name_type)
            channel_name = m.group(1)
            groupChannelsTab.append((channel_name, mdfFile.iter_get(None,i,j,None,True)))
        channelTab.append(groupChannelsTab)
    return channelTab

def memory_limited_eval(path:str):
    mdf = MDF(path, "minimal")
    data_records = getFullChannelList(mdf)
    stats = {}

    for group in data_records:
        for name, channel in group:

            for element in channel:
                vals = element[0]
                aggregate_statistics(name, vals, stats)

path = "path/to/my/file.mdf4"
memory_limited_eval(path)

Description

I'd like to calculate basic statistics (min, max, mean) on signals extracted from MDF4 files. My MDF4 files are quite large, 10-20GB, while my computer has a memory of 8Gb. As you can see in the code snippet, I was on to iterate through the whole file once, collect all the information, and aggregate it into a dictionary.

Sorry for the many questions!

Thank you for your help in advance!

danielhrisca commented 11 months ago

If you need to compute min/max for each channel then the fastest way is to use iter_channels