Fractional split bug on duplicated dataframes indices

Fractional split feature of Splitter returns an undesired result when one tries to split a pandas dataframe with duplicated indices without passing any argument for id_column.

The following examples are illustrating the bug.

Let's create a dataframe with duplicated indices:

import pandas as pd

# Create separate dfs
df_1 = pd.DataFrame(np.random.normal(size=(5_000, )),
                   columns=["metric_val"])
df_1['frame'] = 1

df_2 = pd.DataFrame(np.random.normal(size=(5_000, )),
                   columns=["metric_val"])
df_2['frame'] = 2

# Concat and shuffle
dataframe = pd.concat([df_1, df_2]).sample(frac=1)

Now perform a fractional split on it:

from ambrosia.splitter import Splitter

# Create `Splitter` instance and make split based on dataframe index (no `id_column` provided)
splitter = Splitter()
factor = 0.5

result_1 = splitter.run(dataframe=dataframe, 
                        method='hash', 
                        part_of_table=factor,
                        salt='bug')
result_1.group.value_counts()

# Output:
# A    15000
# B    10000
# Name: group, dtype: int64

So, some of the objects after the split are duplicated and now appear in groups several times. We can see that totally groups are bigger than the original dataframe.

This behaviour does not repeat if we try to split dataframe on the column with duplicated ids.

# Create column from dataframe indices and split on it

dataframe = dataframe.reset_index().rename(columns={'index': 'id_column'})

result_2 = splitter.run(dataframe=dataframe, 
                        id_column='id_column',
                        method='hash', 
                        part_of_table=factor,
                        salt='bug')

result_2.group.value_counts()

# Output:
# A    5000
# B    5000
# Name: group, dtype: int64

But if we look deeper, there is another unusual behaviour:

# Let's count objects origin dataframe frequencies in group A

result_2[result_2.group == 'A'].frame.value_counts()

# Output:
# A    2500
# B    2500
# Name: frame, dtype: int64

Objects from two original dataframes appear in the group equally, which in general is not desired. This should be inspected further.

Bug was not checked on Spark implementation of same methods, but the care should be taken for them as well.

At the end, I want to add that duplicate indices are undesirable on the id column in the vast majority of splitting issues. It will be nice to add duplicated id check in Splitter and warn user via logger.

MobileTeleSystems / Ambrosia

Fractional split bug on duplicated dataframes indices #10