MIT-LCP / wfdb-python

Native Python WFDB package
MIT License
738 stars 300 forks source link

Memory Exception When Merging Large Volumes of Waveform Data Files Using wrdb.wrsamp() #464

Open DishanH opened 1 year ago

DishanH commented 1 year ago

I'm trying to merge multiple waveform data (.dat) files into a single file. I'm using the wrdb.wrsamp() function for this task. The total number of files is approximately 10,000 and each one has 3 channels. I've tried several times, but every attempt results in a memory exception, requiring more than 40GB of memory. I'm unsure if I am doing something incorrect.

I've been unable to find a method to write the files incrementally. My current approach is to read each sample, combine all signals into an array, and write them. While this works fine with a small number of files, I'm having difficulties when it comes to larger datasets. Each file contains over 6 minutes of data.

Any assistance insights or suggestions on this matter would be highly appreciated.

DishanH commented 1 year ago

I have modified the library to use chunking instead of concatenating everything at once, resulting in a 300% reduction in memory usage compared to the original.

chunk_size = 1000000
        b_write = np.zeros((0,), dtype=np.uint8)
    p = 0
    for i in range(0, len(d_signal), chunk_size):
        print(p.__str__() + " of " + (int(len(d_signal) / chunk_size)).__str__())
        p += 1
        chunk = d_signal[i:i+chunk_size]
        b1 = chunk & [255] * tsamps_per_frame
        b2 = (chunk & [65280] * tsamps_per_frame) >> 8

        # Interweave the bytes so that the same samples' bytes are consecutive
        b1 = b1.reshape((-1, 1))
        b2 = b2.reshape((-1, 1))
        chunk_bytes = np.concatenate((b1, b2), axis=1)
        chunk_bytes = chunk_bytes.reshape((1, -1))[0]

        # Convert to un_signed 8 bit dtype to write
        chunk_bytes = chunk_bytes.astype("uint8")
        b_write = np.concatenate((b_write, chunk_bytes))`
bemoody commented 1 year ago

Thanks! Just to be clear, I assume you're talking about the function wr_dat_file, and your code would be to replace the code at lines 2381 to 2392 (following elif fmt == "16").

The existing code looks to me like it's a lot more complicated than it needs to be. I'm sure that your replacement code is more efficient, but I also suspect that the entire thing could be replaced with just one or two numpy function calls - there's no need to make so many copies of the data.

Compare this with how format 80 is handled (see the code under if fmt == "80"). Format 16 could probably be handled in a very similar way - we don't need to add an offset in that case, but we do need to convert to little-endian 16-bit integers and then reinterpret as an array of bytes.

Please consider opening a pull request with your changes.