ONSdigital / sml-python-small

Statistical Methods Library for Python Pandas methods used in SPP.
MIT License
10 stars 5 forks source link

Modify Pandas Formatting in Example Code #66

Closed gibbardsteve closed 12 months ago

gibbardsteve commented 12 months ago

Synopsis

The demo pandas example for totals and components and thousand pounds correction is not formatting the output CSV as per the provided UAT CSV files. MQD have asked that by default the pandas example should expand any cell that is a list into individual columns.

E.g Target Variable target_variable [TargetVariable(identifier='q42', original_value='32', final_value='0.032'),...]

should written as separate columns for each identifier with the final_value as the cell value

E.g q42 0.032

Similarly, the totals and components example should expand the Final Components column into separate col_1, col_2, ... headings

Checklist

Description

README.md/thousand_pounds.py/test_thousand_pounds.py PEP8 Formatting states that class names should be camel case. Totals and components code follows this format but thousand pounds does not, updated classes in thousand_pounds to use the correct format for naming.

totals_and_components.py Type hinting corrected to show the function returns a list of final_components Minor docstring updates

example.py filter_columns_by_pattern() to handle input columns (e.g component_1, component_2,...) that could be dynamic.

expand_list_column() to handle output columns (e.g TargetVariables) that could be dynamic and separate into a separate datastructure where each element of the list is mapped to a separate column.

Updated the code so that thousand pounds and totals and components examples are now functions that take a path, input_csv and output_csv filename.

Columns that were previously declared as having spaces (e.g Absolute Difference) have been updated to be snake case and use the naming from the example UAT CSVs.

The output returned from the main method (e.g totals_and_components) has some minor post-processing to ensure the column headings and data align with UAT CSV files before being written as CSV output files.

pandas_wrapper.py Columns that were previously declared as having spaces (e.g Absolute Difference) have been updated to be snake case and use the naming from the example UAT CSVs.

example_test_data.csv Updated so that it only includes the input columns as this is data input for the totals_and_components method

example_test_data_pandas_output.csv Kept in git (for now) to show the example changes implemented by this PR.