Error plot distribution of categorical feature

ReinhardSellmair commented 9 months ago

Describe the bug A ValueError is thrown when plotting distribution of categorical feature.

To Reproduce I'm using version 0.10.3 Running following code:

import nannyml as nml
from IPython.display import display

reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
display(reference_df.head())

column_names = ['car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car', 'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred']

calc = nml.UnivariateDriftCalculator(
    column_names=column_names,
    treat_as_categorical=['y_pred'],
    timestamp_column_name='timestamp',
    continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
    categorical_methods=['chi2', 'jensen_shannon'],
)

calc.fit(reference_df)
results = calc.calculate(analysis_df)

figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='distribution')
figure.show()

raises following error: ValueError: all input arrays must have the same shape

nnansters commented 9 months ago

Hey @ReinhardSellmair , thanks for submitting your report!

This looks like a local issue, since it didn't trip any tests in the automated build and I also couldn't reproduce this on a fresh installation.

Are you working in a fresh environment as well, or does this occur after updating NannyML?

Would you happen to have some more logging, so we can check where the issue arises?

ReinhardSellmair commented 9 months ago

Thanks for your quick response. I installed nannyML version 0.10.3 to my an existing environment. Here is the full error log:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <command-3447662672388356>, line 20
     17 calc.fit(reference_df)
     18 results = calc.calculate(analysis_df)
---> 20 figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='distribution')
     21 figure.show()

File python/lib/python3.10/site-packages/nannyml/usage_logging.py:238, in log_usage.<locals>.logging_decorator.<locals>.logging_wrapper(*args, **kwargs)
    236 finally:
    237     if runtime_exception is not None:
--> 238         raise runtime_exception
    239     else:
    240         return res

File python/lib/python3.10/site-packages/nannyml/usage_logging.py:187, in log_usage.<locals>.logging_decorator.<locals>.logging_wrapper(*args, **kwargs)
    184 runtime_exception = None
    185 try:
    186     # run original function
--> 187     res = func(*args, **kwargs)
    188 except BaseException as exc:
    189     runtime_exception = exc

File python3.10/site-packages/nannyml/drift/univariate/result.py:249, in Result.plot(self, kind, *args, **kwargs)
    234     return plot_metrics(
    235         self,
    236         title='Univariate drift metrics',
   (...)
    246         metric_name='Method',
    247     )
    248 elif kind == 'distribution':
--> 249     return plot_distributions(
    250         self,
    251         reference_data=self.reference_data,
    252         analysis_data=self.analysis_data,
    253         chunker=self.chunker,
    254     )
    255 else:
    256     raise InvalidArgumentsException(
    257         f"unknown plot kind '{kind}'. " f"Please provide on of: ['drift', 'distribution']."
    258     )

File python/lib/python3.10/site-packages/nannyml/plots/blueprints/distributions.py:83, in plot_distributions(result, reference_data, analysis_data, chunker, title, figure, x_axis_time_title, x_axis_chunk_title, y_axis_title, figure_args, subplot_title_format, number_of_columns)
     80 x_axis_is_time_based = is_time_based_x_axis(analysis_chunk_start_dates, analysis_chunk_end_dates)
     82 if column_name in result.categorical_column_names and method in result.categorical_method_names:
---> 83     figure = _plot_stacked_bar(
     84         figure=figure,
     85         row=row,
     86         col=col,
     87         chunker=chunker,
     88         column_name=column_name,
     89         metric_display_name=method,
     90         reference_data=reference_data[column_name],
     91         reference_data_timestamps=reference_data[result.timestamp_column_name]
     92         if x_axis_is_time_based
     93         else None,
     94         reference_alerts=reference_result.alerts(key),
     95         reference_chunk_keys=reference_result.chunk_keys,
     96         reference_chunk_periods=reference_result.chunk_periods,
     97         reference_chunk_indices=reference_result.chunk_indices,
     98         reference_chunk_start_dates=reference_result.chunk_start_dates,
     99         reference_chunk_end_dates=reference_result.chunk_end_dates,
    100         analysis_data=analysis_data[column_name],
    101         analysis_data_timestamps=analysis_data[result.timestamp_column_name] if x_axis_is_time_based else None,
    102         analysis_alerts=analysis_result.alerts(key),
    103         analysis_chunk_keys=analysis_result.chunk_keys,
    104         analysis_chunk_periods=analysis_result.chunk_periods,
    105         analysis_chunk_indices=analysis_result.chunk_indices,
    106         analysis_chunk_start_dates=analysis_chunk_start_dates,
    107         analysis_chunk_end_dates=analysis_chunk_end_dates,
    108     )
    109 elif column_name in result.continuous_column_names and method in result.continuous_method_names:
    110     figure = _plot_joyplot(
    111         figure=figure,
    112         row=row,
   (...)
    133         analysis_chunk_end_dates=analysis_chunk_end_dates,
    134     )

File python/lib/python3.10/site-packages/nannyml/plots/blueprints/distributions.py:285, in _plot_stacked_bar(figure, column_name, metric_display_name, reference_data, reference_data_timestamps, analysis_data, analysis_data_timestamps, chunker, reference_alerts, reference_chunk_keys, reference_chunk_periods, reference_chunk_indices, reference_chunk_start_dates, reference_chunk_end_dates, analysis_alerts, analysis_chunk_keys, analysis_chunk_periods, analysis_chunk_indices, analysis_chunk_start_dates, analysis_chunk_end_dates, row, col, hover)
    276 if has_reference_results:
    277     reference_value_counts = calculate_value_counts(
    278         data=reference_data,
    279         chunker=chunker,
   (...)
    282         missing_category_label='Missing',
    283     )
--> 285     figure = stacked_bar(
    286         figure=figure,
    287         stacked_bar_table=reference_value_counts,
    288         color=Colors.BLUE_SKY_CRAYOLA,
    289         chunk_indices=reference_chunk_indices,
    290         chunk_start_dates=reference_chunk_start_dates,
    291         chunk_end_dates=reference_chunk_end_dates,
    292         annotation='Reference',
    293         showlegend=True,
    294         legendgrouptitle_text=f'<b>{column_name}</b>',
    295         legendgroup=column_name,
    296         subplot_args=subplot_args,
    297     )
    299     assert reference_chunk_indices is not None
    300     analysis_chunk_indices = analysis_chunk_indices + (max(reference_chunk_indices) + 1)

File python/lib/python3.10/site-packages/nannyml/plots/components/stacked_bar_plot.py:143, in stacked_bar(figure, stacked_bar_table, color, chunk_start_dates, chunk_end_dates, chunk_indices, subplot_args, annotation, **kwargs)
    131     hover.add(data['value_counts_normalised'], name='value_counts_normalised')
    132     hover.add(data['value_counts'], name='value_counts')
    134     figure.add_trace(
    135         Bar(
    136             name=category,
    137             x=x,
    138             y=data['value_counts_normalised'],
    139             orientation='v',
    140             marker=dict(line_color=color, color=category_colors_transparent[i], line_width=0),
    141             yperiodalignment="start",
    142             offset=0,
--> 143             customdata=hover.get_custom_data(),
    144             hovertemplate=hover.get_template(),
    145             hoverlabel=dict(bgcolor=category_colors_transparent[i], font=dict(color='white')),
    146             **kwargs,
    147         ),
    148         **subplot_args,
    149     )
    151 # Shade chunk type
    152 x0 = chunk_start_dates.min() if is_time_based_x_axis(chunk_start_dates, chunk_end_dates) else chunk_indices.min()

File python3.10/site-packages/nannyml/plots/components/hover.py:60, in Hover.get_custom_data(self)
     57 if not isinstance(self.custom_data[0], (List, np.ndarray)):
     58     return np.asarray([self.custom_data, self.custom_data])
---> 60 return np.stack(self.custom_data, axis=-1)

File python3.10/site-packages/numpy/core/shape_base.py:449, in stack(arrays, axis, out, dtype, casting)
    447 shapes = {arr.shape for arr in arrays}
    448 if len(shapes) != 1:
--> 449     raise ValueError('all input arrays must have the same shape')
    451 result_ndim = arrays[0].ndim + 1
    452 axis = normalize_axis_index(axis, result_ndim)

ValueError: all input arrays must have the same shape```

michael-nml commented 8 months ago

Hi @ReinhardSellmair, I'm also unable to reproduce this using NannyML 0.10.3 in my environment. There's probably a different dependency version somewhere that is causing this.

Would you be able to share the output of pip freeze in the environment where you're seeing this issue?

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

anopsy commented 1 week ago

I'm experiencing exactly the same problem: This code figure = results.filter(column_names=results.categorical_column_names, methods=['jensen_shannon']).plot(kind='distribution') leads to this error coming from nannyml Hover, get_custom_data(), and it's call to np.stack result in this message ValueError: all input arrays must have the same shape My versions: {'nannyml': '0.12.1', 'pandas': '2.2.3', 'polars': '0.20.31', 'pyarrow': '14.0.2', 'numpy': '1.24.4'} python:3.10.12

Btw, maybe for the debugging purposes (and better communication with our users) we could implement .show_version() and ask for output from it in the "report a bug"-type of New Issue?

EDIT: Fixed it with using the right environment : {'nannyml': '0.12.1', 'pandas': '1.5.3', , 'pyarrow': '14.0.2', 'numpy': '1.24.4'}, pandas >2 could be the culprit

NannyML / nannyml

Error plot distribution of categorical feature #370