Open niallmurphy93 opened 1 week ago
Function RemoveFilteredRowsCB
class in our _ospar.ipynb
.
class RemoveFilteredRowsCB(Callback):
""" Remove rows from a dataframe based on a filter condition. """
def __init__(self, filters:dict, verbose:bool=False):
fc.store_attr()
def __call__(self, tfm: 'Transformer'):
for df_name, filter_condition in self.filters.items():
self._process_dataframe(tfm, df_name, filter_condition)
def _process_dataframe(self, tfm: 'Transformer', df_name: str, filter_condition: Callable):
if df_name in tfm.dfs:
df = tfm.dfs[df_name]
initial_rows = len(df)
df = self._apply_filter(df, filter_condition)
removed_rows = initial_rows - len(df)
self._log_removal(df_name, removed_rows)
tfm.dfs[df_name] = df
else:
self._log_missing_dataframe(df_name)
def _apply_filter(self, df: pd.DataFrame, filter_condition: Callable) -> pd.DataFrame:
mask = filter_condition(df)
return df[~mask] # Keep rows that don't match the filter
def _log_removal(self, df_name: str, removed_rows: int):
if self.verbose:
print(f"RemoveFilteredRowsCB: Removed {removed_rows} rows from '{df_name}'.")
def _log_missing_dataframe(self, df_name: str):
if self.verbose:
print(f"RemoveFilteredRowsCB: Dataframe '{df_name}' not found in tfm.dfs.")
If we choose to adopt this approach for removing data then we could move it to utils.ipynb.
I've implemented a
RemoveFilteredRowsCB
class in our_ospar.ipynb
handler to manage data removal in our pipeline. This generic approach allows for flexible filtering across different data types. However, we need to discuss whether this strategy is optimal or if we should consider more targeted filtering within individual callbacks.