fastai / fastbook

The fastai book, published as Jupyter Notebooks
Other
21.51k stars 8.33k forks source link

09_tabular: ProductSize histogram's y-axis is mislabeled #590

Open rigdern opened 1 year ago

rigdern commented 1 year ago

Problem

The book's histogram of ProductSizes in the "Partial Dependence" section has a mislabeled y-axis. Consequently, the histogram communicates the wrong counts for some of the ProductSizes. Here are some ProductSizes it mislabeled:

ProductSize Correct Count Book's Incorrect Count
Large 280 ~500
Mini 627 ~100

See below for details.

Book's incorrect histogram

The "Partial Dependence" section has a ProductSize histogram that is produced by this code:

p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)), c);

and renders like this:

image

Corrected histogram

We can reveal the mistake in the book's histogram by inspecting a textual histogram from the dataframe:

cond = (df.saleYear<2011) | (df.saleMonth<10)
df_valid = df[~cond]
df_valid.ProductSize.value_counts(dropna=False)

That code produces this textual histogram:

NaN               3930
Medium            1331
Large / Medium    1223
Mini               627
Small              484
Large              280
Compact            113
Name: ProductSize, dtype: int64

See the table at the top of this issue for a comparison between the counts of these ProductSizes and the ones from the book's histogram.

Cause

The problem is that the code that labels the y-axis assumes that the bottom bar is ProductSize 0, the next bar is ProductSize 1, etc. but this isn't the case. The bars do not appear to be ordered by ProductSize.

Example fix

Here's some code that properly labels the y-axis by sorting the y-axis labels to match the order of the bars:

counts = valid_xs_final['ProductSize'].value_counts(sort=False)
p = counts.plot.barh()
c = [to.classes['ProductSize'][i] for i in counts.index.values]
plt.yticks(range(len(c)), c)
image
rigdern commented 1 year ago

Looks like a fix was submitted in pull request #410.

jhanschoo commented 4 months ago

I can confirm this issue; I ran into it while doing my own notes. My fix was as follows:

p = valid_xs_final['ProductSize'].value_counts(sort=False).sort_index().plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)), c);