Problem with conversion of large MDF Files

xoxStudios commented 1 month ago

I have a problem to load a large MDF (mf4) file into a dataframe using the iter_to_dataframe() method. To tackle this problem, we already tried to switch from to_dataframe() to iter_to_dataframe() method, which works fine for smaller files as before, but get's killed for larger files (>~20GB). We also tried to alter the parameters of raster, chunk_ram_size and reduce_memory_usage to avoid memory issues, but the problem persists. Do you know any workaroud to debug or have a solution to this problem?

Quick explanation of the workflow we are using: loading a mf4 file to a df, then we are doing some processing and filtering and loading it into parquet at the end for further use.

snippet:

def _apply_dataframe_processing(self, mdf: MDF, signals_renaming_mapping: dict[str, str]) -> pd.DataFrame:
    """Converts mdf to dataframe, adjusts time column, renames signals, and drops duplicates after renaming"""
    df_list = []
    for df in mdf.iter_to_dataframe(time_from_zero=False, raster=1/10**self.precision, raw=True, reduce_memory_usage=True, chunk_ram_size=209715200):
        if df.empty:
            continue
        df.reset_index(inplace=True, names="time")
        df["time"] = df["time"].round(self.precision)
        df = df.rename(columns=signals_renaming_mapping)
        columns_to_keep = list(~df.columns.duplicated(keep="first"))
        df = df.loc[:, columns_to_keep]
        df_list.append(df)
        # also tried using pickle and dask to store iterable in storage not memory but process gets killed inside iter_to_dataframe() method
    return df

danielhrisca commented 1 month ago

Any chance you could send the file?

alex-ruehe commented 5 days ago

@xoxStudios We had similar issues. In the end we moved to a 2-step-process: Write one parquet file per chunk and merge them afterwards (using pyarrow)

danielhrisca / asammdf

Problem with conversion of large MDF Files #1021