Fractional split feature of Splitter returns an undesired result when one tries to split a pandas dataframe with duplicated indices without passing any argument for id_column.
from ambrosia.splitter import Splitter
# Create `Splitter` instance and make split based on dataframe index (no `id_column` provided)
splitter = Splitter()
factor = 0.5
result_1 = splitter.run(dataframe=dataframe,
method='hash',
part_of_table=factor,
salt='bug')
result_1.group.value_counts()
# Output:
# A 15000
# B 10000
# Name: group, dtype: int64
So, some of the objects after the split are duplicated and now appear in groups several times.
We can see that totally groups are bigger than the original dataframe.
This behaviour does not repeat if we try to split dataframe on the column with duplicated ids.
# Create column from dataframe indices and split on it
dataframe = dataframe.reset_index().rename(columns={'index': 'id_column'})
result_2 = splitter.run(dataframe=dataframe,
id_column='id_column',
method='hash',
part_of_table=factor,
salt='bug')
result_2.group.value_counts()
# Output:
# A 5000
# B 5000
# Name: group, dtype: int64
But if we look deeper, there is another unusual behaviour:
# Let's count objects origin dataframe frequencies in group A
result_2[result_2.group == 'A'].frame.value_counts()
# Output:
# A 2500
# B 2500
# Name: frame, dtype: int64
Objects from two original dataframes appear in the group equally, which in general is not desired.
This should be inspected further.
Bug was not checked on Spark implementation of same methods, but the care should be taken for them as well.
At the end, I want to add that duplicate indices are undesirable on the id column in the vast majority of splitting issues.
It will be nice to add duplicated id check in Splitter and warn user via logger.
Fractional split feature of
Splitter
returns an undesired result when one tries to split apandas
dataframe with duplicated indices without passing any argument forid_column
.The following examples are illustrating the bug.
Let's create a dataframe with duplicated indices:
Now perform a fractional split on it:
So, some of the objects after the split are duplicated and now appear in groups several times. We can see that totally groups are bigger than the original dataframe.
This behaviour does not repeat if we try to split dataframe on the column with duplicated ids.
But if we look deeper, there is another unusual behaviour:
Objects from two original dataframes appear in the group equally, which in general is not desired. This should be inspected further.
Bug was not checked on
Spark
implementation of same methods, but the care should be taken for them as well.At the end, I want to add that duplicate indices are undesirable on the id column in the vast majority of splitting issues. It will be nice to add duplicated id check in
Splitter
and warn user via logger.