ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.14k stars 12.91k forks source link

What does "oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]" mean? #609

Closed minertom closed 3 years ago

minertom commented 3 years ago

Hi, I am not completely new to python but this construct is a little bit beyond what I have encountered before. It seems to me that the statement oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"] is some sort of "list comprehenson". But I don't understand it. It is apparently a recursive function.

The only guess that I can come up with is that this statement replaces text within the input data. For example, in the CSV file that is used, oecd_bli, "Inequality" is replaced with "Total" and "Indicator" is replace with "Life expectancy".

What is the term that is used for this kind of python function?

Thank You Tom

ageron commented 3 years ago

Hi @minertom ,

Thanks for your question.

This syntax is a special indexing syntax that works with Pandas DataFrames, NumPy arrays, TensorFlow tensors and a few other libraries.

Here's a simple example:

import numpy as np

a = np.array([10, 20, 30, 40, 50])
i = np.array([False, True, False, True, True]) # is True for every item we want, and otherwise False
print(a[i]) # prints [20 40 50]

Now suppose I only want to keep the even numbers in an array, here's one way to do it:

a = np.array([1, 3, 4, 8, 2, 5, 4])
i = (a % 2 == 0) # this will be equal to array([False, False,  True,  True,  True, False,  True])
print(a[i]) # prints [4 8 2 4]

Now let's look at the line that confused you:

oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]

First, note that oecd_bli["INEQUALITY"]=="TOT" is a pandas Series equal to True everywhere the "INEQUALITY" feature is equal to "TOT". So oecd_bli[oecd_bli["INEQUALITY"]=="TOT"] is a new pandas DataFrame containing only the rows where the "INEQUALITY" feature is equal to "TOT".

Here's a simplified example:

import pandas as pd

oecd_bli = pd.DataFrame({
    "INEQUALITY": ["a", "b", "TOT", "TOT", "c", "TOT"],
    "Other": [10, 20, 30, 40, 50, 60]
})

print(oecd_bli["INEQUALITY"]=="TOT")
# prints this Pandas Series:
# 0    False
# 1    False
# 2     True 
# 3     True
# 4    False
# 5     True
# Name: INEQUALITY, dtype: bool

oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]

print(oecd_bli)
# prints:
#  INEQUALITY  Other
# 2        TOT      30
# 3        TOT      40
# 5        TOT      60

I hope this is clear. For more info on advanced array indexing, check out NumPy's docs and Panda's docs. You can also check out the tutorial notebooks I made:

Hope this helps.

AmeyaSaonerkar commented 3 years ago

Cristal clear explanation, Thank you very much, sir "Aurélien Geron".

Kay0031 commented 2 years ago

Thank you, sir. Thanks for your explanation. @AmeyaSaonerkar