acturtle / cashflower

An open-source Python framework for actuarial cash flow models
https://cashflower.acturtle.com
MIT License
38 stars 9 forks source link

Avoid transpose #370

Closed zchmielewska closed 1 month ago

zchmielewska commented 4 months ago

Avoid transposing in:

    def prepare_output_with_grouping(self, group_sums, output_columns, one_core):
        group_by_column = self.settings["GROUP_BY_COLUMN"]

        log_message("Preparing output...", show_time=True, print_and_save=one_core)

        lst_dfs = []
        for group, data in group_sums.items():
            group_df = pd.DataFrame(data=np.transpose(data), columns=output_columns)
            group_df.insert(0, group_by_column, group)
            lst_dfs.append(group_df)

        output = pd.concat(lst_dfs, ignore_index=True)
        return output

Transposing data can be slow for large datasets. Instead of transposing the data, it might be better to adjust the code that creates the group_sums dictionary so that the data is already in the correct format.

zchmielewska commented 1 month ago

There are 3 places where results are transposed. The other 2 are:

def ind_prepare_output(self, results, output_columns, one_core):
    ...
    total_data = [pd.DataFrame(np.transpose(arr)) for arr in results]
def agg_prepare_output_without_grouping(self, results, output_columns, one_core):
    ...
    results = np.transpose(results)

So, we need to change:

def calculate_model_point(self, row, one_core, progressbar_max):
    # Get results and trim for T_MAX_OUTPUT,results may contain subset of columns
    if len(self.settings["OUTPUT_COLUMNS"]) > 0:
        mp_results = np.array([v.result[:self.settings["T_MAX_OUTPUT"]+1] for v in self.variables if v.name in self.settings["OUTPUT_COLUMNS"]])
    else:
        mp_results = np.array([v.result[:self.settings["T_MAX_OUTPUT"]+1] for v in self.variables])