DistrictDataLabs / yellowbrick

Visual analysis and diagnostic tools to facilitate machine learning model selection.
http://www.scikit-yb.org/
Apache License 2.0
4.3k stars 559 forks source link

Radviz error from DataFrame which doesn't have sequantial index #1285

Open KimByoungmo opened 2 years ago

KimByoungmo commented 2 years ago

Describe the bug I drew RadViz for below DataFrame (below dataset) below error occured (refer to python code) I think this is becoz of below code [radviz.py] image y[i] should be changed to y.iloc[i]

To Reproduce

fig,ax = plt.subplots()
rad = RadViz(ax =ax,
             classes=["not diabete","diabete"])
rad.fit(train_X,train_y)
rad.show()

error message KeyError Traceback (most recent call last) File c:\Users\ksd20\Python\Python39\lib\site-packages\pandas\core\indexes\base.py:3800, in Index.get_loc(self, key, method, tolerance) 3799 try: -> 3800 return self._engine.get_loc(casted_key) 3801 except KeyError as err:

File c:\Users\ksd20\Python\Python39\lib\site-packages\pandas_libs\index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File c:\Users\ksd20\Python\Python39\lib\site-packages\pandas_libs\index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas_libs\hashtable_class_helper.pxi:2263, in pandas._libs.hashtable.Int64HashTable.get_item()

File pandas_libs\hashtable_class_helper.pxi:2273, in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last) Cell In [8], line 7 1 fig,ax = plt.subplots() 2 rad = RadViz(ax =ax, 3 classes=["not diabete","diabete"], 4 5 ) ----> 7 rad.fit(train_X,train_y) 8 rad.show()

File c:\Users\ksd20\Python\Python39\lib\site-packages\yellowbrick\features\radviz.py:159, in RadialVisualizer.fit(self, X, y, kwargs) 137 """ 138 The fit method is the primary drawing input for the 139 visualization since it has both the X and y data required for the (...) 156 Returns the instance of the transformer/visualizer 157 """ 158 super(RadialVisualizer, self).fit(X, y) --> 159 self.draw(X, y, kwargs) 160 return self

File c:\Users\ksd20\Python\Python39\lib\site-packages\yellowbrick\features\radviz.py:200, in RadialVisualizer.draw(self, X, y, *kwargs) 198 row_ = np.repeat(np.expand_dims(row, axis=1), 2, axis=1) 199 xy = (s row_).sum(axis=0) / row.sum() --> 200 label = self._label_encoder[y[i]] 202 to_plot[label][0].append(xy[0]) 203 to_plot[label][1].append(xy[1])

File c:\Users\ksd20\Python\Python39\lib\site-packages\pandas\core\series.py:982, in Series.getitem(self, key) 979 return self._values[key] 981 elif key_is_scalar: --> 982 return self._get_value(key) 984 if is_hashable(key): 985 # Otherwise index.get_value will raise InvalidIndexError 986 try: 987 # For labels that don't resolve as scalars like tuples and frozensets

File c:\Users\ksd20\Python\Python39\lib\site-packages\pandas\core\series.py:1092, in Series._get_value(self, label, takeable) 1089 return self._values[label] 1091 # Similar to Index.get_value, but we do not fall back to positional -> 1092 loc = self.index.get_loc(label) 1093 return self.index._get_values_for_loc(self, loc, label)

File c:\Users\ksd20\Python\Python39\lib\site-packages\pandas\core\indexes\base.py:3802, in Index.get_loc(self, key, method, tolerance) 3800 return self._engine.get_loc(casted_key) 3801 except KeyError as err: ... 3805 # InvalidIndexError. Otherwise we fall through and re-raise 3806 # the TypeError. 3807 self._check_indexing_error(key)

KeyError: 0

Dataset [used dataset] image not started with index 0 and not sequential

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

rebeccabilbro commented 2 years ago

Hello @KimByoungmo and thank you for using Yellowbrick!

This is an issue that has been noted in the past (#1180), and we can definitely appreciate your use case. However, Yellowbrick is designed to work with or without pandas, treating X and y as numpy arrays. Because of that, we cannot simply change y[i] to y.iloc[i], as that would break Radviz's functionality for all users who are not passing in a dataframe.

That said, if you have any ideas about how to change the functionality so that Yellowbrick works more easily with both pandas and non-pandas data, we're all ears! Would you be interested in opening up a PR with some exploratory code to handle non-continuously indexed dataframes that passes both our pandas and our numpy integration tests?

KimByoungmo commented 2 years ago

@rebeccabilbro

I think it would be better to use "isinstance" function for instance, refer to below code

if isinstance(my_object,pd.Series) or isinstance(my_object,pd.DataFrame) :

my_object.iloc[i]

else :

my_object[i]

rebeccabilbro commented 2 years ago

@KimByoungmo looks like you're on the right track (though I think you'll also want an isinstance(my_object, numpy.array) in there too, right?)

We would welcome looking at a PR from you if you are open to contributing to Yellowbrick. We are a small team of unpaid volunteers, so we depend a lot on the contributions of our users to expand/change current functionality. You can check out our contributor's guide to see how to get started!