ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.17k stars 12.92k forks source link

NameError: name 'prepare_country_stats' is not defined #33

Closed Jai-GAY closed 7 years ago

Jai-GAY commented 7 years ago

Hi

anyone knows the workaround ?

in page 43/564 Example 1-1. Training and running a linear model using Scikit-Learn how do i overcome this error ?

Prepare the data

country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)

Traceback (most recent call last): File ".\example_1.1.py", line 12, in country_stats = prepare_country_stats(oecd_bli, gdp_per_capita) NameError: name 'prepare_country_stats' is not defined

Hands-On Machine Learning with Scikit-Learn and TensorFlow

pprivulet commented 7 years ago

01_the_machine_learning_landscape.ipynb: "def prepare_country_stats(oecd_bli, gdp_per_capita):\n",

The function is defined in 01_the_machine_learning_landscape.ipynb Good luck

ageron commented 7 years ago

As @pprivulet pointed out (thanks!), the function is defined in the notebook. I left some code out of the book when there was really nothing interesting or machine learning specific to it. Things like plotting an image, etc. If you get stuck at any point, check out the corresponding notebook, and don't hesitate to ping me, I'll be happy to help.

Cheers

Jai-GAY commented 7 years ago

ya, thanks, i need to execute from / using jupyter notebook

ankursworld commented 7 years ago

Hi @ageron, I have tried to follow the example 1-1 given in your book .. and also tried to append it with the code/function in the jupyter file. However, it is still causing errors. I think it should be a fair expectation to be able to follow the code in the book without correcting it. Could you please see if the code in the book can be updated to be self-sufficient. Or you can refer to the file and ask users to only run that and not the code itself. Thanks Ankur

McCarthyORAL commented 6 years ago

I hope this can be helpful http://www.cnblogs.com/yaoz/p/6858417.html

ageron commented 6 years ago

Hi everyone,

Apparently this missing code is causing some confusion, I'm sorry about that. It is only there to "whet your appetite", to give you a feel of what's coming next, no to be actually executed. But I understand that some readers might want to run it as is. If you really want to execute it, then here's a prepare_country_stats() function you can use:

def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

Just add this function at the beginning of the code, and run the program in the directory that contains the data files (oecd_bli_2015.csv and gdp_per_capita.csv) and you should be fine (except that you must add an import sklearn.linear_model, at least in recent versions of Scikit-Learn).

As you can see, it's a long and boring function that prepares the data to have a nice and clean matrix in the end. Just Pandas stuff, nothing special about it, and nothing interesting with regards to Machine Learning, which is why I didn't want to include it in the book. In general, I avoid including every single line of code in the book, for readability, to keep it short and focused on what matters most, but hopefully, from chapter 2 onwards, you should be able to follow along in the Jupyter notebook very easily.

In the latest release, I added a footnote saying "The code assumes that prepare_country_stats() is already defined: it merges the GDP and life satisfaction data into a single Pandas dataframe." Perhaps that's not clear enough, though: I think I will change this to explicitly tell readers that if they want to run the code, they should do so in the Jupyter notebook which contains all the boring details (this is strongly suggested in the preface, but I know not everyone reads the preface, I certainly don't).

What do you think?

ageron commented 6 years ago

I replaced the footnote with this: "The prepare_country_stats() function's definition is not shown here (see this chapter's Jupyter notebook if you want all the gory details). It's just boring Pandas code that joins the life satisfaction data from the OECD with the GDP per capita data from the IMF."

I also updated the notebook to make the example 1-1's code stand out at the beginning, and I added the prepare_country_stats() function from my previous comment.

Thanks everyone for your very useful feedback! Hopefully, the book will get better and better. :)

saravanakumarjsk commented 6 years ago

Thanks, that was very helpfull

ankursworld commented 6 years ago

Thank you for your clarification and help!

Thanks Ankur

On Sun, Feb 4, 2018 at 9:02 AM, saravana kumar notifications@github.com wrote:

Thanks, that was so helpfull

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml/issues/33#issuecomment-362877714, or mute the thread https://github.com/notifications/unsubscribe-auth/AVwVKpEsCESMa-wiWB7bypYic8hWQKn9ks5tRSSvgaJpZM4NyLtq .

McCarthyORAL commented 6 years ago

Hello Aurélien Géron,

In fetching a dataset from any website for machine learning, Please help me with a python script to first pull data from the website and secondly a script incrementally update the dataset daily.

Thanks Richard

On Mon, Jan 15, 2018 at 3:57 PM, Aurélien Geron notifications@github.com wrote:

Hi everyone,

Apparently this missing code is causing some confusion, I'm sorry about that. It is only there to "whet your appetite", to give you a feel of what's coming next, no to be actually executed. But I understand that some readers might want to run it as is. If you really want to execute it, then here's a prepare_country_stats() function you can use:

def prepare_country_stats(oecd_bli, gdp_per_capita): oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"] oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value") gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True) gdp_per_capita.set_index("Country", inplace=True) full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita, left_index=True, right_index=True) full_country_stats.sort_values(by="GDP per capita", inplace=True) remove_indices = [0, 1, 6, 8, 33, 34, 35] keep_indices = list(set(range(36)) - set(remove_indices)) return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

Just add this function at the beginning of the code, and run the program in the directory that contains the data files (oecd_bli_2015.csv and gdp_per_capita.csv) and you should be fine (except that you must add an import sklearn.linear_model, at least in recent versions of Scikit-Learn).

As you can see, it's a long and boring function that prepares the data to have a nice and clean matrix in the end. Just Pandas stuff, nothing special about it, and nothing interesting with regards to Machine Learning, which is why I didn't want to include it in the book. In general, I avoid including every single line of code in the book, for readability, to keep it short and focused on what matters most, but hopefully, from chapter 2 onwards, you should be able to follow along in the Jupyter notebook very easily.

In the latest release, I added a footnote saying "The code assumes that prepare_country_stats() is already defined: it merges the GDP and life satisfaction data into a single Pandas dataframe." Perhaps that's not clear enough, though: I think I will change this to explicitly tell readers that if they want to run the code, they should do so in the Jupyter notebook which contains all the boring details (this is strongly suggested in the preface, but I know not everyone reads the preface, I certainly don't).

What do you think?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml/issues/33#issuecomment-357722126, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_3xqqYaKNGTnghSPSTUzlLu18EzMdeks5tK3VVgaJpZM4NyLtq .

ageron commented 6 years ago

Hi Richard,

the download part does not have to be in python, it might be simpler just writing a cron job that uses wget or curl to download the file. That said, there's an example code in the notebook for chapter 2: https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

Hope this helps, Aurélien

sor3765 commented 5 years ago

I tried doing this example 1 copied and paste it... then add the def function... I keep getting the KeyError: 'Country'...... from Line 12 : gdp_per_capita.set_index("Country", inplace=True)

How to fix this tiny error? I tried it on Jupyter Notebook and Visual Studio both end up the same error.

ageron commented 5 years ago

Hi @sor3765 , Thanks for your question. Perhaps the problem comes from the data? Are you using oecd_bli_2015.csv and gdp_per_capita.csv which are available in the datasets/lifesat directory or did you try to download the latest data from the OECD and IMF websites? Are you sure you did not modify the code in any way? Perhaps you should download it again, just to be sure? If you copy/pasted the code, perhaps the indentation got modified?

sor3765 commented 5 years ago

I guess I download the file wrong maybe... it didnt explain clear where exactly I can get the file so I tried to get it from someone's github... but I can try this again tomorrow morning.

On Thu, Feb 14, 2019, 12:51 AM Aurélien Geron <notifications@github.com wrote:

Hi @sor3765 https://github.com/sor3765 , Thanks for your question. Perhaps the problem comes from the data? Are you using oecd_bli_2015.csv and gdp_per_capita.csv which are available in the datasets/lifesat https://github.com/ageron/handson-ml/tree/master/datasets/lifesat directory or did you try to download the latest data from the OECD and IMF websites? Are you sure you did not modify the code in any way? Perhaps you should download it again, just to be sure? If you copy/pasted the code, perhaps the indentation got modified?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml/issues/33#issuecomment-463512825, or mute the thread https://github.com/notifications/unsubscribe-auth/AWsy5FTfRznVvUlpTp3KTAB_fvdEWn38ks5vNQdzgaJpZM4NyLtq .

ashokrajv commented 5 years ago

I guess I download the file wrong maybe... it didnt explain clear where exactly I can get the file so I tried to get it from someone's github... but I can try this again tomorrow morning. On Thu, Feb 14, 2019, 12:51 AM Aurélien Geron @.*** wrote: Hi @sor3765 https://github.com/sor3765 , Thanks for your question. Perhaps the problem comes from the data? Are you using oecd_bli_2015.csv and gdp_per_capita.csv which are available in the datasets/lifesat https://github.com/ageron/handson-ml/tree/master/datasets/lifesat directory or did you try to download the latest data from the OECD and IMF websites? Are you sure you did not modify the code in any way? Perhaps you should download it again, just to be sure? If you copy/pasted the code, perhaps the indentation got modified? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AWsy5FTfRznVvUlpTp3KTAB_fvdEWn38ks5vNQdzgaJpZM4NyLtq .

Hi sor3765, Me too facing same issue. I am running in kaggle kernel, and I downloaded the file from public dataset https://www.kaggle.com/abhilashanil/better-life-index-and-gross-domestic-product/kernels Any help? Complete error in attachment Thanks, Ashok KeyError_Country.txt

ageron commented 5 years ago

The files are available directly in this project, in the datasets/lifesat directory: https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/gdp_per_capita.csv https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/oecd_bli_2015.csv

ashokrajv commented 5 years ago

The files are available directly in this project, in the datasets/lifesat directory: https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/gdp_per_capita.csv https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/oecd_bli_2015.csv

Thank you Aurélien Géron, For quick reply. I tried loading both files manually. Able to upload gdp_per_capita.csv. But not able to succeed with oecd_bli_2015.csv. Some existing file in Kaggle dataset is stopping me, but that file doesn't belong to me. How can I handle this? Screen shot attached. image

Appreciate your help!

Thanks, Ashok

ageron commented 5 years ago

Hi @ashokrajv , I have never run into this issue, sorry. It seems that Kaggle wants to avoid data duplication, so they're asking you to reuse the file from the other dataset. Not sure how this is done in Kaggle, I recommend you ask Kaggle. Alternatively, you can just update the notebook to download the files instead of using the ones in the project:

from urllib.request import urlretrieve
URL = "https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/"
datapath = os.path.join("datasets", "lifesat", "")
urlretrieve (URL + "gdp_per_capita.csv", datapath + "gdp_per_capita.csv")
urlretrieve (URL + "oecd_bli_2015.csv", datapath + "oecd_bli_2015.csv")

Then you can load them using pd.read_csv(), as shown in the notebook.

Hope this helps.

ashokrajv commented 5 years ago

Thank you @ageron, I shall try this option. Thanks, Ashok

ashokrajv commented 5 years ago

Hi @ashokrajv , I have never run into this issue, sorry. It seems that Kaggle wants to avoid data duplication, so they're asking you to reuse the file from the other dataset. Not sure how this is done in Kaggle, I recommend you ask Kaggle. Alternatively, you can just update the notebook to download the files instead of using the ones in the project:

from urllib.request import urlretrieve
URL = "https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/"
datapath = os.path.join("datasets", "lifesat", "")
urlretrieve (URL + "gdp_per_capita.csv", datapath + "gdp_per_capita.csv")
urlretrieve (URL + "oecd_bli_2015.csv", datapath + "oecd_bli_2015.csv")

Then you can load them using pd.read_csv(), as shown in the notebook.

Hope this helps.

Hi Aurélien Géron, I tried using urlretrieve method as below. replaced below line

datapath = os.path.join("datasets", "lifesat", "")

with the code you have given to read directly from URL from urllib.request import urlretrieve URL = "https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/" datapath = os.path.join("datasets", "lifesat", "") urlretrieve(URL + "gdp_per_capita.csv", datapath + "gdp_per_capita.csv") urlretrieve(URL + "oecd_bli_2015.csv", datapath + "oecd_bli_2015.csv")

Getting error in urlretrieve function. URLError: <urlopen error [Errno -3] Temporary failure in name resolution> Full error in text attachment URLError.txt Any suggestions please. Thanks, Ashok

ageron commented 5 years ago

Wow, that's bad luck! You seem to have DNS issues. Name resolution is converting the domain name (raw.githubusercontent.com) to an IP address (151.101.8.133). You could try again, as it says, it's probably a "temporary failure". If the problem persists, check your network settings or your ISP, something's fishy. Or you could just download the files manually by visiting https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/gdp_per_capita.csv and https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/lifesat/oecd_bli_2015.csv and select File > Save Page As... Hope this helps

rkuma107 commented 5 years ago

Thanks Ageron. I am learning ML for the first time and such forum and your active participation is very helpful.

rkuma107 commented 5 years ago

Hi ageron, After following your first example, i excited and tested a simple linear regression model to calculate square of a number.

x = np.array([[1],[2], [3], [4], [5],[6], [7], [8], [9], [10], [11], [12], [13], [14], [15],[20],[25],[30]])
y = np.array([[1],[4], [9], [16], [25], [36], [49], [64], [81], [100], [121], [144], [169], [196], [225],[400],[625],[900]])
m = sklearn.linear_model.LinearRegression()
m2.fit(x,y)
m2.predict([[10]]) # array([[151.49643705]])
m2.predict([[35]]) # array([[881.60332542]])

In such a simple case, why Linear regression model is not able to give correct square for 10 & 35 ? My apology if i am not suppose to ask such question here.

ageron commented 5 years ago

Hi @rkuma107 ,

A linear regression model assumes that the data you are trying to model is linear. In other words, it assumes that y = w1×x1 + w2×x2 + ... + wn×xn + b (plus some Gaussian noise), and it tries to find the coefficients w1 to wn and the bias term b. In your case there is a single input feature x1, so the model simplifies to: y = w1×x1 + b

However, in your example the data is not linear, it is quadratic, so the linear model makes inaccurate predictions. You can see this clearly in the following plot:

image

You can get this plot by running the following code in Jupyter (or Colab):

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import sklearn.linear_model

#x = np.array([[1],[2], [3], [4], [5],[6], [7], [8], [9], [10], [11], [12], [13], [14], [15],[20],[25],[30]])
#y = np.array([[1],[4], [9], [16], [25], [36], [49], [64], [81], [100], [121], [144], [169], [196], [225],[400],[625],[900]])
x = np.array([list(range(1, 15)) + list(range(15, 31, 5))]).reshape(-1, 1)
y = x ** 2
m = sklearn.linear_model.LinearRegression()
m.fit(x,y)
m.predict([[10]]) # array([[151.49643705]])
m.predict([[35]]) # array([[881.60332542]])

plt.plot(x, y, "o")
plt.xlabel("x", fontsize=16)
plt.ylabel("y", rotation=0, fontsize=16)
xs = np.linspace(0, 30, 100).reshape(-1, 1)
ys = m.predict(xs)
plt.plot(xs, ys)

Note that I defined x and y a bit differently: you can use range() to avoid typing long lists of integers, and you can run operations on NumPy arrays directly, for example y = x**2. Hope this helps.