franckalbinet / marisco

Encoding IAEA MARIS data as NetCDF and others.
https://fr.anckalbi.net/marisco/
Apache License 2.0
3 stars 1 forks source link

Discuss optimal placement and implementation for removing data #24

Open niallmurphy93 opened 1 week ago

niallmurphy93 commented 1 week ago

I've implemented a RemoveFilteredRowsCB class in our _ospar.ipynb handler to manage data removal in our pipeline. This generic approach allows for flexible filtering across different data types. However, we need to discuss whether this strategy is optimal or if we should consider more targeted filtering within individual callbacks.

niallmurphy93 commented 1 week ago

Function RemoveFilteredRowsCB class in our _ospar.ipynb.

class RemoveFilteredRowsCB(Callback):
    """ Remove rows from a dataframe based on a filter condition. """

    def __init__(self, filters:dict, verbose:bool=False):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for df_name, filter_condition in self.filters.items():
            self._process_dataframe(tfm, df_name, filter_condition)

    def _process_dataframe(self, tfm: 'Transformer', df_name: str, filter_condition: Callable):
        if df_name in tfm.dfs:
            df = tfm.dfs[df_name]
            initial_rows = len(df)
            df = self._apply_filter(df, filter_condition)
            removed_rows = initial_rows - len(df)
            self._log_removal(df_name, removed_rows)
            tfm.dfs[df_name] = df
        else:
            self._log_missing_dataframe(df_name)

    def _apply_filter(self, df: pd.DataFrame, filter_condition: Callable) -> pd.DataFrame:
        mask = filter_condition(df)
        return df[~mask]  # Keep rows that don't match the filter

    def _log_removal(self, df_name: str, removed_rows: int):
        if self.verbose:
            print(f"RemoveFilteredRowsCB: Removed {removed_rows} rows from '{df_name}'.")

    def _log_missing_dataframe(self, df_name: str):
        if self.verbose:
            print(f"RemoveFilteredRowsCB: Dataframe '{df_name}' not found in tfm.dfs.")

If we choose to adopt this approach for removing data then we could move it to utils.ipynb.