I have a problem to load a large MDF (mf4) file into a dataframe using the iter_to_dataframe() method.
To tackle this problem, we already tried to switch from to_dataframe() to iter_to_dataframe() method, which works fine for smaller files as before, but get's killed for larger files (>~20GB).
We also tried to alter the parameters of raster, chunk_ram_size and reduce_memory_usage to avoid memory issues, but the problem persists.
Do you know any workaroud to debug or have a solution to this problem?
Quick explanation of the workflow we are using:
loading a mf4 file to a df, then we are doing some processing and filtering and loading it into parquet at the end for further use.
snippet:
def _apply_dataframe_processing(self, mdf: MDF, signals_renaming_mapping: dict[str, str]) -> pd.DataFrame:
"""Converts mdf to dataframe, adjusts time column, renames signals, and drops duplicates after renaming"""
df_list = []
for df in mdf.iter_to_dataframe(time_from_zero=False, raster=1/10**self.precision, raw=True, reduce_memory_usage=True, chunk_ram_size=209715200):
if df.empty:
continue
df.reset_index(inplace=True, names="time")
df["time"] = df["time"].round(self.precision)
df = df.rename(columns=signals_renaming_mapping)
columns_to_keep = list(~df.columns.duplicated(keep="first"))
df = df.loc[:, columns_to_keep]
df_list.append(df)
# also tried using pickle and dask to store iterable in storage not memory but process gets killed inside iter_to_dataframe() method
return df
I have a problem to load a large MDF (mf4) file into a dataframe using the iter_to_dataframe() method. To tackle this problem, we already tried to switch from to_dataframe() to iter_to_dataframe() method, which works fine for smaller files as before, but get's killed for larger files (>~20GB). We also tried to alter the parameters of raster, chunk_ram_size and reduce_memory_usage to avoid memory issues, but the problem persists. Do you know any workaroud to debug or have a solution to this problem?
Quick explanation of the workflow we are using: loading a mf4 file to a df, then we are doing some processing and filtering and loading it into parquet at the end for further use.
snippet: