bbartling / open-fdd

Fault Detection Diagnostics (FDD) for HVAC datasets
MIT License
59 stars 14 forks source link

pandas.eval() for massive datasets? #20

Closed bbartling closed 1 month ago

bbartling commented 9 months ago

Current method with apply in pandas:

def apply(self, df: pd.DataFrame) -> pd.DataFrame:
    # Existing checks
    df['static_check_'] = (
        df[self.duct_static_col] < df[self.duct_static_setpoint_col] - self.duct_static_inches_err_thres)
    df['fan_check_'] = (
        df[self.supply_vfd_speed_col] >= self.vfd_speed_percent_max - self.vfd_speed_percent_err_thres)

    # Combined condition check
    df["combined_check"] = df['static_check_'] & df['fan_check_']

    # Rolling sum to count consecutive trues
    rolling_sum = df["combined_check"].rolling(window=5).sum()
    # Set flag to 1 if rolling sum equals the window size (5)
    df["fc1_flag"] = (rolling_sum == 5).astype(int)

    return df

Use eval?

def apply_with_eval(self, df: pd.DataFrame) -> pd.DataFrame:
    # Use eval for simple comparison operations
    df.eval('static_check_ = @self.duct_static_col < (@self.duct_static_setpoint_col - @self.duct_static_inches_err_thres)', inplace=True)
    df.eval('fan_check_ = @self.supply_vfd_speed_col >= (@self.vfd_speed_percent_max - @self.vfd_speed_percent_err_thres)', inplace=True)

    # Combined condition check (bitwise AND)
    df["combined_check"] = df['static_check_'] & df['fan_check_']

    # Rolling sum to count consecutive trues (This part remains the same)
    rolling_sum = df["combined_check"].rolling(window=5).sum()
    # Set flag to 1 if rolling sum equals the window size (5)
    df["fc1_flag"] = (rolling_sum == 5).astype(int)

    return df

Any insights appreciated its sort of interesting to see what ChatGPT states in a conversation about this.

  • Continue using your current approach with standard pandas operations, especially for the more complex parts like the rolling window operation.
  • If performance becomes an issue, consider using eval() for the simpler comparison operations, but benchmark to ensure it's actually faster for your specific case.
  • Always balance between readability/maintainability and performance, choosing the one that best fits your project's requirements.
  • Remember, while eval() can offer performance improvements in certain cases, it's always good to benchmark with your specific dataset to ensure it's actually faster and doesn't compromise readability or maintainability.