ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.18k stars 12.92k forks source link

Chapter 3: MNIST "sort_by_target" error #608

Closed Iovus closed 3 years ago

Iovus commented 3 years ago

Hi, I got the issue that "sort_by_target" cannot be executed on the MNIST dataset. So far I cannot solve the problem.

Help is appreciated :)

Python 3.8.5 NUMPY version is 1.19.4. Scikit-learn version is 0.24.0.

Utilizing the code for "sort_by_target" from the chapter 3 repository I got the following error message:


KeyError Traceback (most recent call last)

in ----> 1 sort_by_target(mnist) # fetch_openml() returns an unsorted dataset in sort_by_target(mnist) 2 reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1] 3 reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1] ----> 4 mnist.data[:60000] = mnist.data[reorder_train] 5 mnist.target[:60000] = mnist.target[reorder_train] 6 mnist.data[60000:] = mnist.data[reorder_test + 60000] ~/.local/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key) 3028 if is_iterator(key): 3029 key = list(key) -> 3030 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] 3031 3032 # take() does not accept boolean indexers ~/.local/lib/python3.8/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1263 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1264 -> 1265 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) 1266 return keyarr, indexer 1267 ~/.local/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1305 if missing == len(indexer): 1306 axis_name = self.obj._get_axis_name(axis) -> 1307 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 1308 1309 ax = self.obj._get_axis(axis) KeyError: "None of [Int64Index([ 1, 21, 34, 37, 51, 56, 63, 68, 69,\n 75,\n ...\n 59910, 59917, 59927, 59939, 59942, 59948, 59969, 59973, 59990,\n 59992],\n dtype='int64', length=60000)] are in the [columns]"
tigerH666 commented 3 years ago

Hi lovus, I got the same issue. I found out the "mnist.data" is not numpy array as in the book anymore. The "mnist.data" is a dataframe object I just added one line to convert the dataframe to numpy, and it works just like in the book.

This is the sort_by_target function that works for me:

def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data = mnist.data.to_numpy()
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]

2021-1-23 Python 3.7.6 Numpy 1.19.5 Scikit-learn 0.24.1

yukikatase commented 3 years ago

Thank you!

ageron commented 3 years ago

Thanks for your feedback everyone!

So here's the deal: in Scikit-Learn 0.24, the fetch_openml() function started returning Pandas DataFrames instead of NumPy arrays. This was causing various errors in the notebooks that used fetch_openml(). This is why you ran into an issue, I'm sorry about that. I updated all the notebooks to set as_frame=False when calling fetch_openml(), and now everything's back to normal. You can get the latest code using git pull, and make sure to install the latest version of the libraries as well (using environment.yml or requirements.txt).

Closing this issue. Please feel free to reopen it if the problem persists.

Hope this helps!