carissalow / rapids

Reproducible Analysis Pipeline for Data Streams
http://www.rapids.science/
GNU Affero General Public License v3.0
37 stars 20 forks source link

ValueError: cannot handle a non-unique multi-index! in heatmap_sensors_per_minute_per_time_segment #169

Closed junoslukan closed 2 years ago

junoslukan commented 2 years ago

The rule heatmap_sensors_per_minute_per_time_segment throws the following error:

Traceback (most recent call last):
  File "/rapids/.snakemake/scripts/tmpr5wbxgdu.heatmap_sensors_per_minute_per_time_segment.py", line 99, in <module>
    data_for_plot_per_segment = getDataForPlot(phone_data_yield_per_segment)
  File "/rapids/.snakemake/scripts/tmpr5wbxgdu.heatmap_sensors_per_minute_per_time_segment.py", line 45, in getDataForPlot
    phone_data_yield_per_segment = phone_data_yield_per_segment.set_index(["local_segment_start_datetimes", "minutes_after_segment_start"]).reindex(full_index).reset_index().fillna(0)
  File "miniconda3/envs/rapids/lib/python3.7/site-packages/pandas/util/_decorators.py", line 309, in wrapper
    return func(*args, **kwargs)
  File "miniconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/frame.py", line 4036, in reindex
    return super().reindex(**kwargs)
  File "miniconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/generic.py", line 4464, in reindex
    axes, level, limit, tolerance, method, fill_value, copy
  File "miniconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/frame.py", line 3883, in _reindex_axes
    index, method, copy, level, fill_value, limit, tolerance
  File "miniconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/frame.py", line 3899, in _reindex_index
    new_index, method=method, level=level, limit=limit, tolerance=tolerance
  File "miniconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2319, in reindex
    raise ValueError("cannot handle a non-unique multi-index!")
ValueError: cannot handle a non-unique multi-index!

I do not have a minimal example at hand, but I have debugged the problem enough so that I can hopefully explain what I think is going on. I have provided a mock-up example below.

As seen from the traceback, the dataframe phone_data_yield_per_segment should be reindex with full_index, but the existing index .set_index(["local_segment_start_datetimes", "minutes_after_segment_start"]) is not unique.

This index is created earlier by first selecting the maximum local_date_time (and sensor) within a segment and the local minute, in this line:

# calculate the number of sensors logged at least one row of data per minute.
phone_data_yield_per_segment = phone_data_yield_per_segment.groupby(["local_segment", "length", "local_date", "local_hour", "local_minute"])[["sensor", "local_date_time"]].max().reset_index()    

Next, the minutes_after_segment_start are calculated by considering the Timedelta in minutes between local_date_time and local_segment_start_datetimes:

# calculate the number of minutes after local start datetime of the segment
phone_data_yield_per_segment["minutes_after_segment_start"] = ((phone_data_yield_per_segment["local_date_time"] - phone_data_yield_per_segment["local_segment_start_datetimes"]) / pd.Timedelta(minutes=1)).astype("int")

It is this cast to int (acting as floor effectively) that creates duplicates in the index later on.

Consider the following example:

import pandas as pd

phone_data_yield_per_segment = pd.DataFrame(
    data={
        "local_segment_start_datetimes": [
            pd.Timestamp("2021-12-14T23:59:59"),
            pd.Timestamp("2021-12-14T23:59:59"),
        ],
        "local_date_time": [
            pd.Timestamp("2021-12-15T15:39:59"),
            pd.Timestamp("2021-12-15T15:40:45"),
        ],
    }
)

After calculating minutes_after_segment_start as above, we get:

print(phone_data_yield_per_segment["minutes_after_segment_start"])
# 0    940
# 1    940
# Name: minutes_after_segment_start, dtype: int32

This creates the problem with non-unique index.

A workaround would be to first drop duplicates on indexing columns in getDataForPlot(), but I am not sure if this is actually the desired outcome:

phone_data_yield_per_segment = phone_data_yield_per_segment.drop_duplicates(subset=["local_segment_start_datetimes", "minutes_after_segment_start"], keep="first")
phone_data_yield_per_segment.set_index(["local_segment_start_datetimes", "minutes_after_segment_start"], verify_integrity=True).reindex(full_index)
JulioV commented 2 years ago

Thanks for reporting! @Meng6 could you take a look please?

Meng6 commented 2 years ago

Hi @junoslukan, thanks for reporting! I updated the code in https://github.com/carissalow/rapids/commit/8a24ad5be51ddcd43b1cbdac5815f2a8963a0840. You can pull the latest code from the plots/fixbug#169 branch. Please let me know if you still get errors.

JulioV commented 2 years ago

Let's meerge